Get max value of binned time-interval - sql

I have a 'requests' table with a 'time_request' column which has a timestamp for each request. I want to know the maximum amount of requests that i had in a single minute.
So im guessing i need to somehow 'group by' a 1m time interval, and then do some sort of MAX(COUNT(request_id))? Although nested aggregations are not allowed.
Will appreciate any help.
Table example:
request_id | time_request
------------------+---------------------
ab1 | 2021-03-29 16:20:05
ab2 | 2021-03-29 16:20:20
bc3 | 2021-03-31 20:34:07
fw3 | 2021-03-31 20:38:53
fe4 | 2021-03-31 20:39:53
Expected result: 2 (There were a maximum of 2 requests in a single minute)
Thanks!

You may use window function count and specify logical interval of one minute as the window boundary. It will calculate the count for each row and will account all the rows that are within one minute before.
Code for Postgres is below:
with a as (
select
id
, cast(ts as timestamp) as ts
from(values
('ab1', '2021-03-29 16:20:05'),
('ab2', '2021-03-29 16:20:20'),
('bc3', '2021-03-31 20:34:07'),
('fw3', '2021-03-31 20:38:53'),
('fe4', '2021-03-31 20:39:53')
) as t(id, ts)
)
, count_per_interval as (
select
a.*
, count(id) over (
order by ts asc
range between
interval '1' minute preceding
and current row
) as cnt_per_min
from a
)
select max(cnt_per_min)
from count_per_interval
| max |
| --: |
| 2 |
db<>fiddle here

Related

How do I calculate a rolling average over a specific range timeframe in BigQuery?

I have a BigQuery table like the one below, where data wasn't necessarily recorded at a consistent rate:
| timestamp | value |
|-------------------------|-------|
| 2022-10-01 00:03:00 UTC | 2.43 |
| 2022-10-01 00:17:00 UTC | 4.56 |
| 2022-10-01 00:36:00 UTC | 3.64 |
| 2022-10-01 00:58:00 UTC | 2.15 |
| 2022-10-01 01:04:00 UTC | 2.90 |
| 2022-10-01 01:13:00 UTC | 5.88 |
... ...
I want to calculate a rolling average (as a new column) on value over a certain timeframe, e.g. the previous 12 hours. I know it's relatively simple to do over a fixed number of rows, and I've tried using LAG and TIMESTAMP_SUB functions to select the right values to average over, but I'm quite new to SQL so I'm not even sure if this is the right approach.
Does anyone know how to go about this? Thanks!
Please use a window function.
You need to calculate a date and hour column as an integer. For this we take the unix date and multiply it by 24 hours. Then we add the hours of the day. We ignore daylight saving time.
WITH
tbl AS (SELECT 10* rand() as val, timestamp_add(snapshot_date,interval cast(rand()*5000 as int64) minute) as timestamps FROM UNNEST(GENERATE_Timestamp_ARRAY("2021-01-01 00:00:00","2023-01-01 0:00:00",INTERVAL 1 hour)) AS snapshot_date)
SELECT
*,
unix_date(date(timestamps))*24+extract(hour from timestamps) as dummy_time,
avg(val) over WIN1_range as rolling_avg,
sum(1) over WIN1_range as values_in_avg
FROM
tbl
window WIN1_range as (order by unix_date(date(timestamps))*24+extract(hour from timestamps) range between 12 PRECEDING and current row)
BigQuery has simplified specifications for the range frame of window functions:
Tip: If you want to use a range with a date, use ORDER BY with the UNIX_DATE() function. If you want to use a range with a timestamp, use the UNIX_SECONDS(), UNIX_MILLIS(), or UNIX_MICROS() function.
Here, we can simply use unix_seconds() when ordering the records in the partition, and accordingly specify an interval of 12 hours as seconds:
select ts, val,
avg(value) over(
order by unix_seconds(ts)
range between 12 * 60 * 60 preceding and current row
) as avg_last_12_hours
from mytable
Now say we wanted the average over the last 2 days, we would use unix_date() instead:
select ts, val,
avg(value) over(
order by unix_date(ts)
range between 2 preceding and current row
) as avg_last_12_hours
from mytable

Postgres query for difference between latest and first record of the day

Postgres data alike this:
| id | read_at | value_1 |
| ------|------------------------|---------|
| 16239 | 2021-11-28 16:13:00+00 | 1509 |
| 16238 | 2021-11-28 16:12:00+00 | 1506 |
| 16237 | 2021-11-28 16:11:00+00 | 1505 |
| 16236 | 2021-11-28 16:10:00+00 | 1501 |
| 16235 | 2021-11-28 16:09:00+00 | 1501 |
| ..... | .......................| .... |
| 15266 | 2021-11-28 00:00:00+00 | 1288 |
A value is added every minute and increases over time.
I would like to get the current total for the day and have this in a Grafana stat panel. Above it would be: 221 (1509-1288). Latest record minus first record of today.
SELECT id,read_at,value_1
FROM xyz
ORDER BY id DESC
LIMIT 1;
With this the latest record is given (A).
SELECT id,read_at,value_1
FROM xyz
WHERE read_at = CURRENT_DATE
ORDER BY id DESC
LIMIT 1;
With this the first record of the day is given (B).
Grafana cannot do math on this (A-B). Single query would be best.
Sadly my database knowledge is low and attempts at building queries have not succeeded, and have taken all afternoon now.
Theoretical ideas to solve this:
Subtract the min from the max value where time frame is today.
Using a lag, lag it for the count of records that are recorded today. Subtract lag value from latest value.
Window function.
What is the best way (performance wise) forward and how would such query be written?
Calculate the cumulative total last_value - first_value for each record for the current day using window functions (this is the t subquery) and then pick the latest one.
select current_total, read_at::date as read_at_date
from
(
select last_value(value_1) over w - first_value(value_1) over w as current_total,
read_at
from the_table
where read_at >= current_date and read_at < current_date + 1
window w as (partition by read_at::date order by read_at)
) as t
order by read_at desc limit 1;
However if it is certain that value_1 only "increases over time" then simple grouping will do and that is by far the best way performance wise:
select max(value_1) - min(value_1) as current_total,
read_at::date as read_at_date
from the_table
where read_at >= current_date and read_at < current_date + 1
group by read_at::date;
Please, check if it works.
Since you intend to publish it in Grafana, the query does not impose a period filter.
https://www.db-fiddle.com/f/4jyoMCicNSZpjMt4jFYoz5/3080
create table g (id int, read_at timestamp, value_1 int);
insert into g
values
(16239, '2021-11-28 16:13:00+00', 1509),
(16238, '2021-11-28 16:12:00+00', 1506),
(16237, '2021-11-28 16:11:00+00', 1505),
(16236, '2021-11-28 16:10:00+00', 1501),
(16235, '2021-11-28 16:09:00+00', 1501),
(15266, '2021-11-28 00:00:00+00', 1288);
select date(read_at), max(value_1) - min(value_1)
from g
group by date(read_at);
Since you data contains multiple values for 2 distinct times (16:09 and 16:10), this indicates the possibility that min and max values do not always increase in the time interval. Leaving open the possibility of a decrease. So do you want max - min reading or the difference in reading at min/max time. The following get value difference to get difference between the first and latest reading of the day as indicated in the title.
with parm(dt) as
( values (date '2021-11-28') )
, first_read (f_read,f_value) as
( select read_at, value_1
from test_tbl
where read_at at time zone 'UTC'=
( select min(read_at at time zone 'UTC')
from test_tbl
join parm
on ((read_at at time zone 'UTC')::date = dt)
)
)
, last_read (l_read, l_value) as
( select read_at,value_1
from test_tbl
where read_at at time zone 'UTC'=
( select max(read_at at time zone 'UTC')
from test_tbl
join parm
on ((read_at at time zone 'UTC')::date = dt)
)
)
select l_read, f_read, l_value, f_value, l_value - f_value as "Day Difference"
from last_read
join first_read on true;

Get a rolling count of timestamps in SQL

I have a table (in an Oracle DB) that looks something like what is shown below with about 4000 records. This is just an example of how the table is designed. The timestamps range for several years.
| Time | Action |
| 9/25/2019 4:24:32 PM | Yes |
| 9/25/2019 4:28:56 PM | No |
| 9/28/2019 7:48:16 PM | Yes |
| .... | .... |
I want to be able to get a count of timestamps that occur on a rolling 15 minute interval. My main goal is to identify the maximum number of timestamps that appear for any 15 minute interval. I would like this done by looking at each timestamp and getting a count of timestamps that appear within 15 minutes of that timestamp.
My goal would to have something like
| Interval | Count |
| 9/25/2019 4:24:00 PM - 9/25/2019 4:39:00 | 2 |
| 9/25/2019 4:25:00 PM - 9/25/2019 4:40:00 | 2 |
| ..... | ..... |
| 9/25/2019 4:39:00 PM - 9/25/2019 4:54:00 | 0 |
I am not sure how I would be able to do this, if at all. Any ideas or advice would be much appreciated.
If you want any 15 minute interval in the data, then you can use:
select t.*,
count(*) over (order by timestamp
range between interval '15' minute preceding and current row
) as cnt_15
from t;
If you want the maximum, then use rank() on this:
select t.*
from (select t.*, rank() over (order by cnt_15 desc) as seqnum
from (select t.*,
count(*) over (order by timestamp
range between interval '15' minute preceding and current row
) as cnt_15
from t
) t
) t
where seqnum = 1;
This doesn't produce exactly the results you specify in the query. But it does answer the question:
I want to be able to get a count of timestamps that occur on a rolling 15 minute interval. My main goal is to identify the maximum number of timestamps that appear for any 15 minute interval.
You could enumerate the minutes with a recursive query, then bring the table with a left join:
with recursive cte (start_dt, max_dt) as (
select trunc(min(time), 'mi'), max(time) from mytable
union all
select start_dt + interval '1' minute, max_dt from cte where start_dt < max_dt
)
select
c.start_dt,
c.start_dt + interval '15' minute end_dt,
count(t.time) cnt
from cte c
left join mytable t
on t.time >= c.start_dt
and t.time < c.start_dt + interval '15' minute
group by c.start_dt

Moving average last 30 days

I want to find the number of unique users active in the last 30 days. I want to calculate this for today, but also for days in the past. The dataset contains user ids, dates and events triggered by the user saved in BigQuery. A user is active by opening a mobile app triggering the event session_start. Example of the unnested dataset.
| resettable_device_id | date | event |
------------------------------------------------------
| xx | 2017-06-09 | session_start |
| yy | 2017-06-09 | session_start |
| xx | 2017-06-11 | session_start |
| zz | 2017-06-11 | session_start |
I found a solution which suits my problem:
BigQuery: how to group and count rows within rolling timestamp window?
My BigQuery script so far:
#standardSQL
WITH daily_aggregation AS (
SELECT
PARSE_DATE("%Y%m%d", event_dim.date) AS day,
COUNT(DISTINCT user_dim.device_info.resettable_device_id) AS unique_resettable_device_ids
FROM `ANDROID.app_events_*`,
UNNEST(event_dim) AS event_dim
WHERE event_dim.name = "session_start"
GROUP BY day
)
SELECT
day,
unique_resettable_device_ids,
SUM(unique_resettable_device_ids)
OVER(ORDER BY UNIX_SECONDS(TIMESTAMP(day)) DESC ROWS BETWEEN 2592000 PRECEDING AND CURRENT ROW) AS unique_ids_rolling_30_days
FROM daily_aggregation
ORDER BY day
This script results in the following table:
| day | unique_resettable_device_ids | unique_ids_rolling_30_days |
------------------------------------------------------------------------
| 2018-06-05 | 1807 | 2614 |
| 2018-06-06 | 711 | 807 |
| 2018-06-07 | 96 | 96 |
The problem is that the column unique_ids_rolling_30_days is just a cumulative sum of the column unique_resettable_device_ids. How can I fix the rolling window function in my script?
"The problem is that the column unique_ids_rolling_30_days is just a cumulative sum of the column unique_resettable_device_ids."
Of course, as that's exactly what the code
SUM(unique_resettable_device_ids)
OVER(ORDER BY UNIX_SECONDS(TIMESTAMP(day)) DESC ROWS BETWEEN 2592000 PRECEDING AND CURRENT ROW) AS unique_ids_rolling_30_days
is asking for.
Check out https://stackoverflow.com/a/49866033/132438 where the question asks about specifically counting uniques in a rolling window: Turns out it's a very slow operation given how much memory it requires.
The solution for this when you want a rolling count of uniques: Go for approximate results.
From the linked answer:
#standardSQL
SELECT DATE_SUB(date, INTERVAL i DAY) date_grp
, HLL_COUNT.MERGE(sketch) unique_90_day_users
, HLL_COUNT.MERGE(DISTINCT IF(i<31,sketch,null)) unique_30_day_users
, HLL_COUNT.MERGE(DISTINCT IF(i<8,sketch,null)) unique_7_day_users
, COUNT(*) window_days
FROM (
SELECT DATE(creation_date) date, HLL_COUNT.INIT(owner_user_id) sketch
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE EXTRACT(YEAR FROM creation_date)=2017
GROUP BY 1
), UNNEST(GENERATE_ARRAY(1, 90)) i
GROUP BY 1
HAVING window_days=90
ORDER BY date_grp
Working solution for a weekly calculation of the number of active users in the last 30 days.
#standardSQL
WITH days AS (
SELECT day
FROM UNNEST(GENERATE_DATE_ARRAY('2018-01-01', CURRENT_DATE(), INTERVAL 1 WEEK)) AS day
), periods AS (
SELECT
DATE_SUB(days.day, INTERVAL 30 DAY) AS StartDate,
days.day AS EndDate FROM days
)
SELECT
periods.EndDate AS Day,
COUNT(DISTINCT user_dim.device_info.resettable_device_id) as resettable_device_ids
FROM `ANDROID.app_events_*`,
UNNEST(event_dim) AS event_dim
CROSS JOIN periods
WHERE
PARSE_DATE("%Y%m%d", event_dim.date) BETWEEN periods.StartDate AND periods.EndDate
AND event_dim.name = "session_start"
GROUP BY Day
ORDER BY Day DESC

SQL grouping by datetime with a maximum difference of x minutes

I have a problem with grouping my dataset in MS SQL Server.
My table looks like
# | CustomerID | SalesDate | Turnover
---| ---------- | ------------------- | ---------
1 | 1 | 2016-08-09 12:15:00 | 22.50
2 | 1 | 2016-08-09 12:17:00 | 10.00
3 | 1 | 2016-08-09 12:58:00 | 12.00
4 | 1 | 2016-08-09 13:01:00 | 55.00
5 | 1 | 2016-08-09 23:59:00 | 10.00
6 | 1 | 2016-08-10 00:02:00 | 5.00
Now I want to group the rows where the SalesDate difference to the next row is of a maximum of 5 minutes.
So that row 1 & 2, 3 & 4 and 5 & 6 are each one group.
My approach was getting the minutes with the DATEPART() function and divide the result by 5:
(DATEPART(MINUTE, SalesDate) / 5)
For row 1 and 2 the result would be 3 and grouping here would work perfectly.
But for the other rows where there is a change in the hour or even in the day part of the SalesDate, the result cannot be used for grouping.
So this is where I'm stuck. I would really appreciate, if someone could point me in the right direction.
You want to group adjacent transactions based on the timing between them. The idea is to assign some sort of grouping identifier, and then use that for aggregation.
Here is an approach:
Identify group starts using lag() and date arithmetic.
Do a cumulative sum of the group starts to identify each group.
Aggregate
The query looks like this:
select customerid, min(salesdate), max(saledate), sum(turnover)
from (select t.*,
sum(case when salesdate > dateadd(minute, 5, prev_salesdate)
then 1 else 0
end) over (partition by customerid order by salesdate) as grp
from (select t.*,
lag(salesdate) over (partition by customerid order by salesdate) as prev_salesdate
from t
) t
) t
group by customerid, grp;
EDIT
Thanks to #JoeFarrell for pointing out I have answered the wrong question. The OP is looking for dynamic time differences between rows, but this approach creates fixed boundaries.
Original Answer
You could create a time table. This is a table that contains one record for each second of the day. Your table would have a second column that you can use to perform group bys on.
CREATE TABLE [Time]
(
TimeId TIME(0) PRIMARY KEY,
TimeGroup TIME
)
;
-- You could use a loop here instead.
INSERT INTO [Time]
(
TimeId,
TimeGroup
)
VALUES
('00:00:00', '00:00:00'), -- First group starts here.
('00:00:01', '00:00:00'),
('00:00:02', '00:00:00'),
('00:00:03', '00:00:00'),
...
('00:04:59', '00:00:00'),
('00:05:00', '00:05:00'), -- Second group starts here.
('00:05:01', '00:05:00')
;
The approach works best when:
You need to reuse your custom grouping in several different queries.
You have two or more custom groups you often use.
Once populated you can simply join to the table and output the desired result.
/* Using the time table.
*/
SELECT
t.TimeGroup,
SUM(Turnover) AS SumOfTurnover
FROM
Sales AS s
INNER JOIN [Time] AS t ON t.TimeId = CAST(s.SalesDate AS Time(0))
GROUP BY
t.TimeGroup
;