Best way to count rows by arbitrary time intervals - sql

My app has a Events table with time-stamped events.
I need to report the count of events during each of the most recent N time intervals. For different reports, the interval could be "each week" or "each day" or "each hour" or "each 15-minute interval".
For example, a user can display how many orders they received each week, day, or hour, or quarter-hour.
1) My preference is to dynamically do a single SQL query (I'm using Postgres) that groups by an arbitrary time interval. Is there a way to do that?
2) An easy but ugly brute force way is to do a single query for all records within the start/end timeframe sorted by timestamp, then have a method manually build a tally by whatever interval.
3) Another approach would be add separate fields to the event table for each interval and statically store an the_week the_day, the_hour, and the_quarter_hour field so I take the 'hit' at the time the record is created (once) instead of every time I report on that field.
What's best practice here, given I could modify the model and pre-store interval data if required (although at the modest expense of doubling the table width)?

Luckily, you are using PostgreSQL. The window function generate_series() is your friend.
Test case
Given the following test table (which you should have provided):
CREATE TABLE event(event_id serial, ts timestamp);
INSERT INTO event (ts)
SELECT generate_series(timestamp '2018-05-01'
, timestamp '2018-05-08'
, interval '7 min') + random() * interval '7 min';
One event for every 7 minutes (plus 0 to 7 minutes, randomly).
Basic solution
This query counts events for any arbitrary time interval. 17 minutes in the example:
WITH grid AS (
SELECT start_time
, lead(start_time, 1, 'infinity') OVER (ORDER BY start_time) AS end_time
FROM (
SELECT generate_series(min(ts), max(ts), interval '17 min') AS start_time
FROM event
) sub
)
SELECT start_time, count(e.ts) AS events
FROM grid g
LEFT JOIN event e ON e.ts >= g.start_time
AND e.ts < g.end_time
GROUP BY start_time
ORDER BY start_time;
The query retrieves minimum and maximum ts from the base table to cover the complete time range. You can use an arbitrary time range instead.
Provide any time interval as needed.
Produces one row for every time slot. If no event happened during that interval, the count is 0.
Be sure to handle upper and lower bound correctly. See:
Unexpected results from SQL query with BETWEEN timestamps
The window function lead() has an often overlooked feature: it can provide a default for when no leading row exists. Providing 'infinity' in the example. Else the last interval would be cut off with an upper bound NULL.
Minimal equivalent
The above query uses a CTE and lead() and verbose syntax. Elegant and maybe easier to understand, but a bit more expensive. Here is a shorter, faster, minimal version:
SELECT start_time, count(e.ts) AS events
FROM (SELECT generate_series(min(ts), max(ts), interval '17 min') FROM event) g(start_time)
LEFT JOIN event e ON e.ts >= g.start_time
AND e.ts < g.start_time + interval '17 min'
GROUP BY 1
ORDER BY 1;
Example for "every 15 minutes in the past week"`
Formatted with to_char().
SELECT to_char(start_time, 'YYYY-MM-DD HH24:MI'), count(e.ts) AS events
FROM generate_series(date_trunc('day', localtimestamp - interval '7 days')
, localtimestamp
, interval '15 min') g(start_time)
LEFT JOIN event e ON e.ts >= g.start_time
AND e.ts < g.start_time + interval '15 min'
GROUP BY start_time
ORDER BY start_time;
Still ORDER BY and GROUP BY on the underlying timestamp value, not on the formatted string. That's faster and more reliable.
db<>fiddle here
Related answer producing a running count over the time frame:
PostgreSQL: running count of rows for a query 'by minute'

Related

Group by arbitrary interval

I have a column that is of type timestamp. I would like to dynamically group the results by random period time (it can be 10 seconds or even 5 hours).
Supposing, I have this kind of data:
Image
If the user provides 2 hours and wants to get the max value of the air_pressure, I would like to have the first row combined with the second one. The result should look like this:
date | max air_pressure
2022-11-22 00:00:00:000 | 978.81666667
2022-11-22 02:00:00:000 | 978.53
2022-11-22 04:00:00:000 | 987.23333333
and so on. As I mentioned, the period must be easy to change, because maybe he wants to group by days/seconds...
The functionality should work like function date_trunc(). But that can only group by minutes/seconds/hours, while I would like to group for arbitrary intervals.
Basically:
SELECT g.start_time, max(air_pressure) AS max_air_pressure
FROM generate_series($start
, $end
, interval '15 min') g(start_time)
LEFT JOIN tbl t ON t.date_id >= g.start_time
AND t.date_id < g.start_time + interval '15 min' -- same interval
GROUP BY 1
ORDER BY 1;
$start and $end are timestamps delimiting your time frame of interest.
Returns all time slots, and NULL for max_air_pressure if no matching entries are found for the time slot.
See:
Best way to count rows by arbitrary time intervals
Aside: "date_id" is an unfortunate column name for a timestamp.

Averaging a variable over a period of time

I am currently having difficulty formulating this into an sql query:
I would like to average the data of a column here twa for a duration of 10 minutes starting from the last value of the table i.e. data included here:
last date-10minutes<=date<=last date
I tried to start a first query but it does not show the right answer:
SELECT AVG(twa), horaire FROM OF50 WHERE ((SELECT horaire FROM of50 ORDER BY horaire DESC LIMIT 1)-INTERVAL '1 minutes'>horaire) ORDER BY horaire;
Regards,
Maybe this will do.
with t as (select max(horaire) maxhoraire from of50)
select AVG(of50.twa)
from of50, t
where of50.horaire between t.maxhoraire - interval '1 minute' and t.maxhoraire;
or even this may do, given that the last value can not be 'younger' then now and at least one event happened during the last minute, though it is not exactly the same and says 'the average over the last 1 minute'
select AVG(twa)
from of50
where horaire >= now() - interval '1 minute';

Listing the hours between two timestamps and grouping by those hours

I am trying to ascertain a count of the couriers that are active every hour of a shift using the the start and end times of their shifts to create an array which I hope to group by. Firstly, when I run it I'm given epoch times back, secondly, I am not able to group by the hours array.
Does anyone have any solutions that they would kindly share with me?
**
SELECT
GENERATE_TIMESTAMP_ARRAY(CAST(fss.start_time_local AS TIMESTAMP), CAST(fss.end_time_local AS TIMESTAMP) , INTERVAL 1 hour) as hours,
#COUNT(sys_scheduled_shift_id) AS number_schedule_shift,
FROM just-data-warehouse.delco_analytics_team_dwh.fact_scheduled_shifts AS fss
#GROUP BY hours
**
For your reference the shift data for the courier is structured like so
To calculate how many couriers have been active at least one minute in every hour I would do it like this:
SELECT
CALENDAR.datetime
,SUM(workers.flag_worker) as n_workers
FROM (
-- CALENDAR
SELECT
cast(datetime as datetime) datetime
FROM UNNEST(GENERATE_TIMESTAMP_ARRAY('2022-01-01T00:00:00', '2022-01-02T00:00:00'
,INTERVAL 1 hour)) AS datetime
) CALENDAR
-- TABLE of SHIFTS
LEFT JOIN (
SELECT * , 1 flag_worker FROM
UNNEST(
ARRAY<STRUCT<worker_id string , shift_start datetime, shift_end datetime>>[
('Worker_01', '2022-01-01T06:00:00','2022-01-01T14:00:00')
,('Worker_02', '2022-01-01T10:00:00','2022-01-01T18:00:00')
]
)
AS workers
)workers
ON CALENDAR.datetime < workers.shift_end
AND DATETIME_ADD(CALENDAR.datetime, INTERVAL 1 hour) > workers.shift_start
GROUP BY CALENDAR.datetime
The idea is to build a calendar of datetimes and then join it with a table of shifts.
Instead of hours, the calendar can be modified to have fractions of hours. Also, there may be a more elegant way to build the calendar.

Getting records from the past x weeks

I've been fighting with an issue about querying the records where created_at is within the current, or past x weeks. Say, today is Wednesday, so that'd be from Monday at midnight, up to now, if x = 1. If x > 1, I'm looking for current week, up to today, or past week, but not using regular interval '1 week' as that'll get me Wednesday to Wednesday, and I'm only looking into "whole" weeks.
I've tried the interval-solution, and also things like WHERE created_at > (CURRENT_DATE - INTERVAL '5 week').
A solution that'll work for both day, month, year etc, would be preferred, as I'm actually building the query through some other backend logic.
I'm looking for a generic query for "Find everything that's been created 'x periods' back.
Edit:
Since last time, I've implemented this in my Ruby on Rails application. This has caused some problems when using HOUR. The built is working for everything but HOUR (MONTH, DAY, and YEAR)
SELECT "customer_uses".*
FROM "customer_uses"
WHERE (customer_uses.created_at > DATE_TRUNC('MONTH', TIMESTAMP '2017-09-17T16:45:01+02:00') - INTERVAL '1 MONTH')
Which works correctly on my test cases. (Checking count of this). The TIMESTAMP is generated by DateTime.now to ensure my test-cases working with a time-override for "time-travelling"-tests, therefore not using the built in function.
(I've stripped away some extra WHERE-calls which should be irrelevant).
Why the HOUR isn't working is a mystery for me, as I'm using it with a interpolated string for HOUR instead of MONTH as above like so:
SELECT "customer_uses".*
FROM "customer_uses"
WHERE (customer_uses.created_at > DATE_TRUNC('HOUR', TIMESTAMP '2017-09-17T16:45:21+02:00') - INTERVAL '1 HOUR')
Your current suggested query is almost right, except that it uses the current date instead of the start of the week:
SELECT * FROM your_table WHERE created_at > (CURRENT_DATE - INTERVAL '5 week')
Instead, we can check 5 week intervals backwards, but from the start of the current week:
SELECT *
FROM your_table
WHERE created_at > DATE_TRUNC('week', CURRENT_DATE) - INTERVAL '5 week';
I believe you should be able to use the above query as a template for other time periods, e.g. a certain number of months. Just replace the unit in DATE_TRUNC and in the interval of the WHERE clause.

Displaing all task with different hours for current day

I have a table with tasks for specified users.
The columns are:
TASK_ID
TASK_NAME
TIME_FROM
TIME_TO
USER_id
I want to create a report that will show tasks for current day, sorted by start time.
But I dont know how to implement comparing task days with sysdate without comparing hours. I want to show on repport tasks for all hours of current day. How can I do that?
Assuming that "tasks for current day" means any task with a time_from of some time today and that time_from and time_to are both date data types
SELECT *
FROM your_table
WHERE trunc(task_from) = trunc(sysdate)
ORDER BY task_from
trunc( <<date>> ) returns <<date>> at midnight. So trunc(sysdate) returns today at midnight. And trunc(task_from) will equal trunc(sysdate) for any date whose day component is today regardless of the time.
From a performance standpoint, a function-based index on trunc(task_from) would likely be beneficial.
Alternately, if task_from is indexed, it may be beneficial to this instead
SELECT *
FROM your_table
WHERE task_from >= trunc(sysdate)
AND task_from < trunc(sysdate+1)
ORDER BY task_from
so that Oracle can do an index scan on task_from.