The Average Number of Rides Completed in 4 Hours - sql

I have a dataset with each ride having its own ride_id and its completion time. I want to know how many rides happen every 4 hours, on average.
Sample Dataset:
dropoff_datetime ride_id
2022-08-27 11:42:02 1715
2022-08-24 05:59:26 1713
2022-08-23 17:40:05 1716
2022-08-28 23:06:01 1715
2022-08-27 03:21:29 1714
For example, I would like to find out between 2022-8-27 12 PM to 2022-8-27 4 PM how many rides happened that time? And then again from 2022-8-27 4 PM to 2022-8-27 8 PM how many rides happened in that 4 hour period?
What I've tried:
I first truncate my dropoff_datetime into the hour. (DATE_TRUNC)
I then group by that hour to get the count of rides per hour.
Example Query:
Note: calling the above table - final.
SELECT DATE_TRUNC('hour', dropoff_datetime) as by_hour
,count(ride_id) as total_rides
FROM final
WHERE 1=1
GROUP BY 1
Result:
by_hour total_rides
2022-08-27 4:00:00 3756
2022-08-27 5:00:00 6710
My question is:
How can I make it so it's grouping every 4 hours instead?

The question actually consists of two parts - how to generate date range and how to calculate the data. One possible approach is to use minimum and maximum dates in the data to generate range and then join with data again:
-- sample data
with dataset (dropoff_datetime, ride_id) AS
(VALUES (timestamp '2022-08-24 11:42:02', 1715),
(timestamp '2022-08-24 05:59:26', 1713),
(timestamp '2022-08-24 05:29:26', 1712),
(timestamp '2022-08-23 17:40:05', 1716)),
-- query part
min_max as (
select min(date_trunc('hour', dropoff_datetime)) d_min, max(date_trunc('hour', dropoff_datetime)) d_max
from dataset
),
date_ranges as (
select h
from min_max,
unnest (sequence(d_min, d_max, interval '4' hour)) t(h)
)
select h, count_if(ride_id is not null)
from date_ranges
left join dataset on dropoff_datetime between h and h + interval '4' hour
group by h
order by h;
Which will produce the next output:
h
_col1
2022-08-23 17:00:00
1
2022-08-23 21:00:00
0
2022-08-24 01:00:00
0
2022-08-24 05:00:00
2
2022-08-24 09:00:00
1
Note that this can be quite performance intensive for big amount of data.
Another approach is to get some "reference point" and start counting from it. For example using minimum data in the dataset:
-- sample data
with dataset (dropoff_datetime, ride_id) AS
(VALUES (timestamp '2022-08-27 11:42:02', 1715),
(timestamp '2022-08-24 05:59:26', 1713),
(timestamp '2022-08-24 05:29:26', 1712),
(timestamp '2022-08-23 17:40:05', 1716),
(timestamp '2022-08-28 23:06:01', 1715),
(timestamp '2022-08-27 03:21:29', 1714)),
-- query part
base_with_curr AS (
select (select min(date_trunc('hour', dropoff_datetime)) from dataset) base,
date_trunc('hour', dropoff_datetime) dropoff_datetime
from dataset)
select date_add('hour', (date_diff('hour', base, dropoff_datetime) / 4)*4, base) as four_hour,
count(*)
from base_with_curr
group by 1;
Output:
four_hour
_col1
2022-08-23 17:00:00
1
2022-08-28 21:00:00
1
2022-08-24 05:00:00
2
2022-08-27 09:00:00
1
2022-08-27 01:00:00
1
Then you can use sequence approach to generate missing dates if needed.

Related

Count events with a cool-down period after each instance

In a Postgres DB I have entries for "events", associated with an id, and when they happened. I need to count them with a special rule.
When an event happens the counter is incremented and for the next 14 days all events of this type are not counted.
Example:
event
created_at
blockdate
action
16
2021-11-11 11:15
25.11.21
count
16
2021-11-11 11:15
25.11.21
block
16
2021-11-13 10:45
25.11.21
block
16
2021-11-16 10:40
25.11.21
block
16
2021-11-23 11:15
25.11.21
block
16
2021-11-23 11:15
25.11.21
block
16
2021-12-10 13:00
24.12.21
count
16
2021-12-15 13:25
24.12.21
block
16
2021-12-15 13:25
24.12.21
block
16
2021-12-15 13:25
24.12.21
block
16
2021-12-20 13:15
24.12.21
block
16
2021-12-23 13:15
24.12.21
block
16
2021-12-31 13:25
14.01.22
count
16
2022-02-05 15:00
19.02.22
count
16
2022-02-05 15:00
19.02.22
block
16
2022-02-13 17:15
19.02.22
block
16
2022-02-21 10:09
07.03.22
count
43
2021-11-26 11:00
10.12.21
count
43
2022-01-01 15:00
15.01.22
count
43
2022-04-13 10:07
27.04.22
count
43
2022-04-13 10:09
27.04.22
block
43
2022-04-13 10:09
27.04.22
block
43
2022-04-13 10:09
27.04.22
block
43
2022-04-13 10:10
27.04.22
block
43
2022-04-13 10:10
27.04.22
block
43
2022-04-13 10:47
27.04.22
block
43
2022-05-11 20:25
25.05.22
count
75
2021-10-21 12:50
04.11.21
count
75
2021-11-02 12:50
04.11.21
block
75
2021-11-18 11:15
02.12.21
count
75
2021-11-18 12:55
02.12.21
block
75
2021-11-18 16:35
02.12.21
block
75
2021-11-24 11:00
02.12.21
block
75
2021-12-01 11:00
02.12.21
block
75
2021-12-14 13:25
28.12.21
count
75
2021-12-15 13:35
28.12.21
block
75
2021-12-26 13:25
28.12.21
block
75
2022-01-31 15:00
14.02.22
count
75
2022-02-02 15:30
14.02.22
block
75
2022-02-03 15:00
14.02.22
block
75
2022-02-17 15:00
03.03.22
count
75
2022-02-17 15:00
03.03.22
block
75
2022-02-18 15:00
03.03.22
block
75
2022-02-23 15:00
03.03.22
block
75
2022-02-25 15:00
03.03.22
block
75
2022-03-04 10:46
18.03.22
count
75
2022-03-08 21:05
18.03.22
block
In Excel I simply add two columns. In one column I carry over a "blockdate", a date until when events have to be blocked. In the other column I compare the ID with the previous ID and the previous "blockdate".
When the IDs a different or the blockdate is less then the current date, I have to count. When I have to count, I set the row's blockdate to the current date + 14 days, otherwise I carry over the previous blockdate.
I tried now to solve this in Postgres with ...
window functions
recursive CTEs
lateral joins
... and all seemed a bit promising, but in the end I failed to implement this tricky count.
For example, my recursive CTE failed with:
aggregate functions are not allowed in WHERE
with recursive event_count AS (
select event
, min(created_at) as created
from test
group by event
union all
( select event
, created_at as created
from test
join event_count
using(event)
where created_at >= max(created) + INTERVAL '14 days'
order by created_at
limit 1
)
)
select * from event_count
Window functions, using lag() to access the previous row don't seem to work because they cannot access columns in the previous row which were created using the window function.
Adding a "block-or-count" information upon entering a new event entry by simply comparing with the last entry wouldn't solve the issue as event entries "go away" after about half a year. So when the first entry goes away, the next one becomes the first and the logic has to be applied on the new situation.
Above test data can be created with:
CREATE TABLE test (
event INTEGER,
created_at TIMESTAMP
);
INSERT INTO test (event, created_at) VALUES
(16, '2021-11-11 11:15'),(16, '2021-11-11 11:15'),(16, '2021-11-13 10:45'),(16, '2021-11-16 10:40'),
(16, '2021-11-23 11:15'),(16, '2021-11-23 11:15'),(16, '2021-12-10 13:00'),(16, '2021-12-15 13:25'),
(16, '2021-12-15 13:25'),(16, '2021-12-15 13:25'),(16, '2021-12-20 13:15'),(16, '2021-12-23 13:15'),
(16, '2021-12-31 13:25'),(16, '2022-02-05 15:00'),(16, '2022-02-05 15:00'),(16, '2022-02-13 17:15'),
(16, '2022-02-21 10:09'),
(43, '2021-11-26 11:00'),(43, '2022-01-01 15:00'),(43, '2022-04-13 10:07'),(43, '2022-04-13 10:09'),
(43, '2022-04-13 10:09'),(43, '2022-04-13 10:09'),(43, '2022-04-13 10:10'),(43, '2022-04-13 10:10'),
(43, '2022-04-13 10:47'),(43, '2022-05-11 20:25'),
(75, '2021-10-21 12:50'),(75, '2021-11-02 12:50'),(75, '2021-11-18 11:15'),(75, '2021-11-18 12:55'),
(75, '2021-11-18 16:35'),(75, '2021-11-24 11:00'),(75, '2021-12-01 11:00'),(75, '2021-12-14 13:25'),
(75, '2021-12-15 13:35'),(75, '2021-12-26 13:25'),(75, '2022-01-31 15:00'),(75, '2022-02-02 15:30'),
(75, '2022-02-03 15:00'),(75, '2022-02-17 15:00'),(75, '2022-02-17 15:00'),(75, '2022-02-18 15:00'),
(75, '2022-02-23 15:00'),(75, '2022-02-25 15:00'),(75, '2022-03-04 10:46'),(75, '2022-03-08 21:05');
This lends itself to a procedural solution, since it has to walk the whole history of existing rows for each event. But SQL can do it, too.
The best solution heavily depends on cardinalities, data distribution, and other circumstances.
Assuming unfavorable conditions:
Big table.
Unknown number and identity of relevant events (event IDs).
Many rows per event.
Some overlap the 14-day time frame, some don't.
Any number of duplicates possible.
You need an index like this one:
CREATE INDEX test_event_created_at_idx ON test (event, created_at);
Then the following query emulates an index-skip scan. If the table is vacuumed enough, it operates with index-only scans exclusively, in a single pass:
WITH RECURSIVE hit AS (
(
SELECT event, created_at
FROM test
ORDER BY event, created_at
LIMIT 1
)
UNION ALL
SELECT t.*
FROM hit h
CROSS JOIN LATERAL (
SELECT t.event, t.created_at
FROM test t
WHERE (t.event, t.created_at)
> (h.event, h.created_at + interval '14 days')
ORDER BY t.event, t.created_at
LIMIT 1
) t
)
SELECT count(*) AS hits FROM hit;
fiddle
I cannot stress enough how fast it's going to be. :)
It's a recursive CTE using a LATERAL subquery, all based on the magic of ROW value comparison (which not all major RDBMS supported properly).
Effectively, we make Postgres skip over the above index once and only take qualifying rows.
For detailed explanation, see:
SELECT DISTINCT is slower than expected on my table in PostgreSQL
Efficiently selecting distinct (a, b) from big table
Optimize GROUP BY query to retrieve latest row per user (chapter 1a)
Different approach?
Like you mention yourself, the unfortunate task definition forces you to re-compute all newer rows for events where old data changes.
Consider working with a constant raster instead. Like a 14-day grid starting from Jan 1 every year. Then the state of each event could be derived from the local frame. Much cheaper and more reliable.
I cannot think of how to do this without recursion.
with recursive ordered as ( -- Order and number the event instances
select event, created_at,
row_number() over (partition by event
order by created_at) as n
from test
), walk as (
-- Get and keep first instances
select event, created_at, n, created_at as current_base, true as keep
from ordered
where n = 1
union all
-- Carry base dates forward and mark records to keep
select c.event, c.created_at, c.n,
case
when c.created_at >= p.current_base + interval '14 days'
then c.created_at
else p.current_base
end as current_base,
(c.created_at >= p.current_base + interval '14 days') as keep
from walk p
join ordered c
on (c.event, c.n) = (p.event, p.n + 1)
)
select *
from walk
order by event, n;
Fiddle Here

Extracting minutes between two timestamps and assigning different weights

I'm breaking my head over how to achieve this in Teradata.
I have two tables, and I need to extract minutes from the Run table and assign hourly weights to them based on the Weights table.
Table 1: Run
Machine Begin End
A 1/1/2010 08:00 AM 1/1/2010 10:45 AM
B 1/2/2010 10:00 AM 1/2/2010 11:45 AM
Table 2: Weights
Weights are assigned for every hour (Record 1 says weight is 10 for every run min between 8am and 9am)
Hour Weight
1/1/2010 08:00 AM 10
1/1/2010 09:00 AM 15
1/1/2010 10:00 AM 16
1/1/2010 11:00 AM 20
1/1/2010 11:00 AM 20
1/1/2010 12:00 AM 25
Needed Result:
Mach Hour Weight Mins Total (Weight*Mins)
A 1/1/2010 08:00 AM 10 60 600
A 1/1/2010 09:00 AM 15 60 900
A 1/1/2010 10:00 AM 16 45 720
B 1/2/2010 10:00 AM 16 60 960
B 1/2/2010 11:00 AM 20 45 900
Any guidance appreciated. Thanks in advance.
Edit: Here are the sample tables
CREATE TABLE RUNS(NAME VARCHAR(50),START_DT timestamp(0),END_dt timestamp(0));
INSERT INTO RUNS VALUES ('A','2020-01-01 08:00:00','2020-01-01 10:15:00');
INSERT INTO RUNS VALUES ('B','2020-01-02 10:00:00','2020-01-02 11:45:00');
CREATE TABLE WEIGHTS(HOUR_MS timestamp(0),WEIGHT INTEGER);
INSERT INTO WEIGHTS('2020-01-01 08:00:00', 10);
INSERT INTO WEIGHTS('2020-01-01 09:00:00', 15);
INSERT INTO WEIGHTS('2020-01-01 10:00:00', 16);
INSERT INTO WEIGHTS('2020-01-01 11:00:00', 20);
INSERT INTO WEIGHTS('2020-01-02 10:00:00', 20);
INSERT INTO WEIGHTS('2020-01-02 11:00:00', 25);
This is a brute force approach using a non-equi-join based on OVERLAPS:
select
machine
,weight
-- get the number of minutes within the hour
,cast((interval(period(begin, end) p_intersect period(hour, hour + interval '1' hour)) minute(4)) as int) as mins
,mins * weight
from run join weights
on period(begin, end) overlaps period(hour, hour + interval '1' hour)
Explain will show a Product Join, which results in high CPU usage.
There's a smarter approach using EXPAND ON, but it's too late for me, maybe tomorrow :-)
An alternate approach using EXPAND ON in a subquery, followed by equality join:
SELECT machine
,TheHour
,weight
,CAST((INTERVAL(pd P_INTERSECT xpd) MINUTE(4)) AS INTEGER) mins
,mins*weight
FROM (
SELECT machine, PERIOD(begin, end) AS pd, xpd, BEGIN(xpd) AS begin_xpd
FROM run
EXPAND ON pd AS xpd
BY ANCHOR PERIOD ANCHOR_HOUR
) x
JOIN weights
ON begin_xpd = TheHour;

How to retrieve the min and max times of a timestamp column based on a time interval of 30 mins?

I am trying to pull a desired output that looks like this
Driver_ID| Interval_Start_Time | Interval_End_Time | Clocked_In_Time | Clocked_Out_Time | Score
232 | 2019-04-02 00:00:00.000 | 2019-04-02 00:30:00.000 | 2019-04-02 00:10:00.000 | 2019-04-02 00:29:00.000 | 0.55
My Goal is to pull the ID in 30 minute or per half hour time intervals, and their min or earliest clocked in time and max or latest clocked out time in that same 30 minute or half hour interval.
The query I have currently is
WITH TIME AS
(SELECT DISTINCT CASE
WHEN extract(MINUTE
FROM offer_time_utc)<30 THEN date_trunc('hour', offer_time_utc)
ELSE date_add('minute',30, date_trunc('hour', offer_time_utc))
END AS interval_start_time,
CASE
WHEN extract(MINUTE
FROM offer_time_utc)<30 THEN date_add('minute',30, date_trunc('hour', offer_time_utc))
ELSE date_add('hour',1, date_trunc('hour', offer_time_utc))
END AS interval_end_time
FROM integrated_delivery.trip_offer_fact offer
WHERE offer.business_day = date '2019-04-01' )
SELECT DISTINCT offer.Driver_ID,
offer.region_uuid,
interval_start_time,
interval_end_time,
min(sched.clocked_in_time_utc) AS clocked_in_time,
max(sched.clocked_out_time_utc) AS clocked_out_time,
cast(scores.acceptance_rate AS decimal(5,3)) AS acceptance_rate
FROM integrated_delivery.trip_offer_fact offer
JOIN TIME ON offer.offer_time_utc BETWEEN time.interval_start_time AND time.interval_end_time
JOIN integrated_delivery.courier_actual_hours_fact sched ON offer.Driver_ID = sched.Driver_ID
JOIN integrated_product.driver_score_v2 scores ON offer.Driver_ID = scores.courier_id
AND offer.region_uuid = scores.region_id
AND offer.region_uuid = sched.region_uuid
AND offer.business_day = date '2019-04-01'
AND sched.business_day = date '2019-04-01'
AND scores.extract_dt = 20190331
AND offer.region_uuid IN('930c534f-a6b6-4bc1-b26e-de5de8930cf9')
GROUP BY 1,2,3,4,7
But it does not seem to give me the correct min and max clocked in and clocked out time in that correct interval as below,
driver_uuid region_uuid interval_start_time interval_end_time clocked_in_time clocked_out_time score
232 bbv 2019-04-01 14:30:00.000 2019-04-01 15:00:00.000 2019-04-01 14:43:13.140 2019-04-01 22:30:46.043 0.173
When I add in these 2 lines,
JOIN TIME ON sched.clocked_in_time_utc BETWEEN time.interval_start_time AND time.interval_end_time
jOIN TIME ON sched.clocked_out_time_utc BETWEEN time.interval_start_time AND time.interval_end_time
iIt gives me an error as I dont think that is correct.
How can I correctly pull in the min and max clock in and clock out time for the correct interval? Meaning I only want the earliest clocked in time and the latest clocked out time in that per half hour interval start and end time.
I appreciate anybody looking ! Thanks

BigQuery and Standard SQL: how to group by arbitrary day interval

I'm a BigQuery and SQL newbie that's continuing to tackle grouping problems. Using Standard SQL in BigQuery, I'd like to group data by X days. Here's a table of data:
event_id | url | timestamp
-----------------------------------------------------------
xx a.html 2016-10-18 15:55:16 UTC
xx a.html 2016-10-19 16:68:55 UTC
xx a.html 2016-10-25 20:55:57 UTC
yy b.html 2016-10-18 15:58:09 UTC
yy b.html 2016-10-18 08:32:43 UTC
zz a.html 2016-10-20 04:44:22 UTC
zz c.html 2016-10-21 02:12:34 UTC
I want to count the number of each event that occurred on each url in intervals of X days, starting from a given date. For example: how could I group this in intervals of 3 days, where my first interval starts on 2016-10-18 00:00:00 UTC? In addition, can I assign the 3rd day of the interval to each row? Example output:
event_id | url | count | 3dayIntervalLabel
-----------------------------------------------------------
xx a.html 2 2016-10-20 --> [18th thru 20th]
yy b.html 2 2016-10-20
zz a.html 1 2016-10-20
zz c.html 1 2016-10-23 --> [21th thru 23th]
xx a.html 1 2016-10-26 --> [24th thru 26th]
I added three annotations to clarify the 3dayIntervalLabel values.
In general, I'm hoping to solve: group by intervals of X days, starting from date Y, and label the intervals using the final date of the each interval.
Please let me know if more clarification is needed.
If you're interested, I've also asked similar questions on StackOverflow (and gotten answers) about grouping this data using a rolling window: initial question and follow-up.
Thanks!
WITH dailyAggregations AS (
SELECT
DATE(ts) AS day,
url,
event_id,
UNIX_SECONDS(TIMESTAMP(DATE(ts))) AS sec,
COUNT(1) AS events
FROM yourTable
GROUP BY day, url, event_id, sec
),
calendar AS (
SELECT day, DATE_ADD(day, INTERVAL 2 DAY) AS endday
FROM UNNEST (GENERATE_DATE_ARRAY('2016-10-18', '2016-11-06', INTERVAL 3 DAY)) AS day
)
SELECT
event_id,
url,
SUM(events) AS `count`,
c.endday AS `ThreedayIntervalLabel`
FROM calendar AS c
JOIN dailyAggregations AS a
ON a.day BETWEEN c.day AND c.endday
GROUP BY endday, url, event_id
If you have a base date, then something like this:
select floor(date_diff(date(timestamp), date '2016-10-18', day) / 3) as days,
count(*)
from t
group by days
order by days;

GROUP BY several hours

I have a table where our product records its activity log. The product starts working at 23:00 every day and usually works one or two hours. This means that once a batch started at 23:00, it finishes about 1:00am next day.
Now, I need to take statistics on how many posts are registered per batch but cannot figure out a script that would allow me achiving this. So far I have following SQL code:
SELECT COUNT(*), DATEPART(DAY,registrationtime),DATEPART(HOUR,registrationtime)
FROM RegistrationMessageLogEntry
WHERE registrationtime > '2014-09-01 20:00'
GROUP BY DATEPART(DAY, registrationtime), DATEPART(HOUR,registrationtime)
ORDER BY DATEPART(DAY, registrationtime), DATEPART(HOUR,registrationtime)
which results in following
count day hour
....
1189 9 23
8611 10 0
2754 10 23
6462 11 0
1885 11 23
I.e. I want the number for 9th 23:00 grouped with the number for 10th 00:00, 10th 23:00 with 11th 00:00 and so on. How could I do it?
You can do it very easily. Use DATEADD to add an hour to the original registrationtime. If you do so, all the registrationtimes will be moved to the same day, and you can simply group by the day part.
You could also do it in a more complicated way using CASE WHEN, but it's overkill on the view of this easy solution.
I had to do something similar a few days ago. I had fixed timespans for work shifts to group by where one of them could start on one day at 10pm and end the next morning at 6am.
What I did was:
Define a "shift date", which was simply the day with zero timestamp when the shift started for every entry in the table. I was able to do so by checking whether the timestamp of the entry was between 0am and 6am. In that case I took only the date of this DATEADD(dd, -1, entryDate), which returned the previous day for all entries between 0am and 6am.
I also added an ID for the shift. 0 for the first one (6am to 2pm), 1 for the second one (2pm to 10pm) and 3 for the last one (10pm to 6am).
I was then able to group over the shift date and shift IDs.
Example:
Consider the following source entries:
Timestamp SomeData
=============================
2014-09-01 06:01:00 5
2014-09-01 14:01:00 6
2014-09-02 02:00:00 7
Step one extended the table as follows:
Timestamp SomeData ShiftDay
====================================================
2014-09-01 06:01:00 5 2014-09-01 00:00:00
2014-09-01 14:01:00 6 2014-09-01 00:00:00
2014-09-02 02:00:00 7 2014-09-01 00:00:00
Step two extended the table as follows:
Timestamp SomeData ShiftDay ShiftID
==============================================================
2014-09-01 06:01:00 5 2014-09-01 00:00:00 0
2014-09-01 14:01:00 6 2014-09-01 00:00:00 1
2014-09-02 02:00:00 7 2014-09-01 00:00:00 2
If you add one hour to registrationtime, you will be able to group by the date part:
GROUP BY
CAST(DATEADD(HOUR, 1, registrationtime) AS date)
If the starting hour must be reflected accurately in the output (as 9, 23, 10, 23 rather than as 10, 0, 11, 0), you could obtain it as MIN(registrationtime) in the SELECT clause:
SELECT
count = COUNT(*),
day = DATEPART(DAY, MIN(registrationtime)),
hour = DATEPART(HOUR, MIN(registrationtime))
Finally, in case you are not aware, you can reference columns by their aliases in ORDER BY:
ORDER BY
day,
hour
just so that you do not have to repeat the expressions.
The below query will give you what you are expecting..
;WITH CTE AS
(
SELECT COUNT(*) Count, DATEPART(DAY,registrationtime) Day,DATEPART(HOUR,registrationtime) Hour,
RANK() over (partition by DATEPART(HOUR,registrationtime) order by DATEPART(DAY,registrationtime),DATEPART(HOUR,registrationtime)) Batch_ID
FROM RegistrationMessageLogEntry
WHERE registrationtime > '2014-09-01 20:00'
GROUP BY DATEPART(DAY, registrationtime), DATEPART(HOUR,registrationtime)
)
SELECT SUM(COUNT) Count,Batch_ID
FROM CTE
GROUP BY Batch_ID
ORDER BY Batch_ID
You can write a CASE statement as below
CASE WHEN DATEPART(HOUR,registrationtime) = 23
THEN DATEPART(DAY,registrationtime)+1
END,
CASE WHEN DATEPART(HOUR,registrationtime) = 23
THEN 0
END