BigQuery and Standard SQL: how to group by arbitrary day interval - sql

I'm a BigQuery and SQL newbie that's continuing to tackle grouping problems. Using Standard SQL in BigQuery, I'd like to group data by X days. Here's a table of data:
event_id | url | timestamp
-----------------------------------------------------------
xx a.html 2016-10-18 15:55:16 UTC
xx a.html 2016-10-19 16:68:55 UTC
xx a.html 2016-10-25 20:55:57 UTC
yy b.html 2016-10-18 15:58:09 UTC
yy b.html 2016-10-18 08:32:43 UTC
zz a.html 2016-10-20 04:44:22 UTC
zz c.html 2016-10-21 02:12:34 UTC
I want to count the number of each event that occurred on each url in intervals of X days, starting from a given date. For example: how could I group this in intervals of 3 days, where my first interval starts on 2016-10-18 00:00:00 UTC? In addition, can I assign the 3rd day of the interval to each row? Example output:
event_id | url | count | 3dayIntervalLabel
-----------------------------------------------------------
xx a.html 2 2016-10-20 --> [18th thru 20th]
yy b.html 2 2016-10-20
zz a.html 1 2016-10-20
zz c.html 1 2016-10-23 --> [21th thru 23th]
xx a.html 1 2016-10-26 --> [24th thru 26th]
I added three annotations to clarify the 3dayIntervalLabel values.
In general, I'm hoping to solve: group by intervals of X days, starting from date Y, and label the intervals using the final date of the each interval.
Please let me know if more clarification is needed.
If you're interested, I've also asked similar questions on StackOverflow (and gotten answers) about grouping this data using a rolling window: initial question and follow-up.
Thanks!

WITH dailyAggregations AS (
SELECT
DATE(ts) AS day,
url,
event_id,
UNIX_SECONDS(TIMESTAMP(DATE(ts))) AS sec,
COUNT(1) AS events
FROM yourTable
GROUP BY day, url, event_id, sec
),
calendar AS (
SELECT day, DATE_ADD(day, INTERVAL 2 DAY) AS endday
FROM UNNEST (GENERATE_DATE_ARRAY('2016-10-18', '2016-11-06', INTERVAL 3 DAY)) AS day
)
SELECT
event_id,
url,
SUM(events) AS `count`,
c.endday AS `ThreedayIntervalLabel`
FROM calendar AS c
JOIN dailyAggregations AS a
ON a.day BETWEEN c.day AND c.endday
GROUP BY endday, url, event_id

If you have a base date, then something like this:
select floor(date_diff(date(timestamp), date '2016-10-18', day) / 3) as days,
count(*)
from t
group by days
order by days;

Related

The Average Number of Rides Completed in 4 Hours

I have a dataset with each ride having its own ride_id and its completion time. I want to know how many rides happen every 4 hours, on average.
Sample Dataset:
dropoff_datetime ride_id
2022-08-27 11:42:02 1715
2022-08-24 05:59:26 1713
2022-08-23 17:40:05 1716
2022-08-28 23:06:01 1715
2022-08-27 03:21:29 1714
For example, I would like to find out between 2022-8-27 12 PM to 2022-8-27 4 PM how many rides happened that time? And then again from 2022-8-27 4 PM to 2022-8-27 8 PM how many rides happened in that 4 hour period?
What I've tried:
I first truncate my dropoff_datetime into the hour. (DATE_TRUNC)
I then group by that hour to get the count of rides per hour.
Example Query:
Note: calling the above table - final.
SELECT DATE_TRUNC('hour', dropoff_datetime) as by_hour
,count(ride_id) as total_rides
FROM final
WHERE 1=1
GROUP BY 1
Result:
by_hour total_rides
2022-08-27 4:00:00 3756
2022-08-27 5:00:00 6710
My question is:
How can I make it so it's grouping every 4 hours instead?
The question actually consists of two parts - how to generate date range and how to calculate the data. One possible approach is to use minimum and maximum dates in the data to generate range and then join with data again:
-- sample data
with dataset (dropoff_datetime, ride_id) AS
(VALUES (timestamp '2022-08-24 11:42:02', 1715),
(timestamp '2022-08-24 05:59:26', 1713),
(timestamp '2022-08-24 05:29:26', 1712),
(timestamp '2022-08-23 17:40:05', 1716)),
-- query part
min_max as (
select min(date_trunc('hour', dropoff_datetime)) d_min, max(date_trunc('hour', dropoff_datetime)) d_max
from dataset
),
date_ranges as (
select h
from min_max,
unnest (sequence(d_min, d_max, interval '4' hour)) t(h)
)
select h, count_if(ride_id is not null)
from date_ranges
left join dataset on dropoff_datetime between h and h + interval '4' hour
group by h
order by h;
Which will produce the next output:
h
_col1
2022-08-23 17:00:00
1
2022-08-23 21:00:00
0
2022-08-24 01:00:00
0
2022-08-24 05:00:00
2
2022-08-24 09:00:00
1
Note that this can be quite performance intensive for big amount of data.
Another approach is to get some "reference point" and start counting from it. For example using minimum data in the dataset:
-- sample data
with dataset (dropoff_datetime, ride_id) AS
(VALUES (timestamp '2022-08-27 11:42:02', 1715),
(timestamp '2022-08-24 05:59:26', 1713),
(timestamp '2022-08-24 05:29:26', 1712),
(timestamp '2022-08-23 17:40:05', 1716),
(timestamp '2022-08-28 23:06:01', 1715),
(timestamp '2022-08-27 03:21:29', 1714)),
-- query part
base_with_curr AS (
select (select min(date_trunc('hour', dropoff_datetime)) from dataset) base,
date_trunc('hour', dropoff_datetime) dropoff_datetime
from dataset)
select date_add('hour', (date_diff('hour', base, dropoff_datetime) / 4)*4, base) as four_hour,
count(*)
from base_with_curr
group by 1;
Output:
four_hour
_col1
2022-08-23 17:00:00
1
2022-08-28 21:00:00
1
2022-08-24 05:00:00
2
2022-08-27 09:00:00
1
2022-08-27 01:00:00
1
Then you can use sequence approach to generate missing dates if needed.

Calculating difference (or deltas) between current and previous row with clickhouse

It would be awesome if there was a way to index rows during a query.
Is there a way to SELECT (compute) the difference of a single column between consecutive rows?
Let's say, something like the following query
SELECT
toStartOfDay(stamp) AS day,
count(day ) AS events ,
day[current] - day[previous] AS difference, -- how do I calculate this
day[current] / day[previous] as percent, -- and this
FROM records
GROUP BY day
ORDER BY day
I want to get the integer and percentage difference between the current row's 'events' column and the previous one for something similar to this:
day
events
difference
percent
2022-01-06 00:00:00
197
NULL
NULL
2022-01-07 00:00:00
656
459
3.32
2022-01-08 00:00:00
15
-641
0.02
2022-01-09 00:00:00
7
-8
0.46
2022-01-10 00:00:00
137
130
19.5
My version of Clickhouse doesn't support window-function but, on looking about the LAG() function mentioned in the comments, I found neighbor(), which works perfectly for what I'm trying to do
SELECT
toStartOfDay(stamp) AS day,
count(day ) AS events ,
(events - neighbor(events, -1)) as diff,
(events / neighbor(events, -1)) as perc
FROM records
GROUP BY day
ORDER BY day

How to bin timestamp data into buckets of custom width of n hours in vertica

I have a table which contains a column Start_Timestamp which has time stamp values like 2020-06-02 21:08:37. I would like to create new column which classifies these timestamps into bins of 6hours.
Eg.
Input :
Start_Timestamp
2020-06-02 21:08:37
2020-07-19 01:23:40
2021-11-13 12:08:37
Expected Output ( Here each bin is of 6hours width) :
Start_Timestamp
Bin
2020-06-02 21:08:37
18H - 24H
2020-07-19 01:23:40
00H - 06H
2021-11-13 12:08:37
12H - 18H
I have tried using TIMESERIES but can anyone help to generate output in following format
It's Vertica. Use the TIME_SLICE() function. Then, combine it with the TO_CHAR() function that Vertica shares with Oracle.
You can always add a CASE WHEN expression to change 00:00 to 24:00, but as that is not the standard, I wouldn't even bother.
WITH
indata(start_ts) AS (
SELECT TIMESTAMP '2020-06-02 21:08:37'
UNION ALL SELECT TIMESTAMP '2020-07-19 01:23:40'
UNION ALL SELECT TIMESTAMP '2021-11-13 12:08:37'
)
SELECT
TIME_SLICE(start_ts,6,'HOUR')
AS tm_slice
, TO_CHAR(TIME_SLICE(start_ts,6,'HOUR'),'HH24:MIH - ')
||TO_CHAR(TIME_SLICE(start_ts,6,'HOUR','END'),'HH24:MIH')
AS caption
, start_ts
FROM indata;
-- out tm_slice | caption | start_ts
-- out ---------------------+-----------------+---------------------
-- out 2020-06-02 18:00:00 | 18:00H - 00:00H | 2020-06-02 21:08:37
-- out 2020-07-19 00:00:00 | 00:00H - 06:00H | 2020-07-19 01:23:40
-- out 2021-11-13 12:00:00 | 12:00H - 18:00H | 2021-11-13 12:08:37
You can simply extract the hour and do some arithmetic:
select t.*,
floor(extract(hour from start_timestamp) / 6) * 6 as bin
from t;
Note: This characterizes the bin by the earliest hour. That seems more useful than a string representation, but you can construct a string if you really want.

How to retrieve the min and max times of a timestamp column based on a time interval of 30 mins?

I am trying to pull a desired output that looks like this
Driver_ID| Interval_Start_Time | Interval_End_Time | Clocked_In_Time | Clocked_Out_Time | Score
232 | 2019-04-02 00:00:00.000 | 2019-04-02 00:30:00.000 | 2019-04-02 00:10:00.000 | 2019-04-02 00:29:00.000 | 0.55
My Goal is to pull the ID in 30 minute or per half hour time intervals, and their min or earliest clocked in time and max or latest clocked out time in that same 30 minute or half hour interval.
The query I have currently is
WITH TIME AS
(SELECT DISTINCT CASE
WHEN extract(MINUTE
FROM offer_time_utc)<30 THEN date_trunc('hour', offer_time_utc)
ELSE date_add('minute',30, date_trunc('hour', offer_time_utc))
END AS interval_start_time,
CASE
WHEN extract(MINUTE
FROM offer_time_utc)<30 THEN date_add('minute',30, date_trunc('hour', offer_time_utc))
ELSE date_add('hour',1, date_trunc('hour', offer_time_utc))
END AS interval_end_time
FROM integrated_delivery.trip_offer_fact offer
WHERE offer.business_day = date '2019-04-01' )
SELECT DISTINCT offer.Driver_ID,
offer.region_uuid,
interval_start_time,
interval_end_time,
min(sched.clocked_in_time_utc) AS clocked_in_time,
max(sched.clocked_out_time_utc) AS clocked_out_time,
cast(scores.acceptance_rate AS decimal(5,3)) AS acceptance_rate
FROM integrated_delivery.trip_offer_fact offer
JOIN TIME ON offer.offer_time_utc BETWEEN time.interval_start_time AND time.interval_end_time
JOIN integrated_delivery.courier_actual_hours_fact sched ON offer.Driver_ID = sched.Driver_ID
JOIN integrated_product.driver_score_v2 scores ON offer.Driver_ID = scores.courier_id
AND offer.region_uuid = scores.region_id
AND offer.region_uuid = sched.region_uuid
AND offer.business_day = date '2019-04-01'
AND sched.business_day = date '2019-04-01'
AND scores.extract_dt = 20190331
AND offer.region_uuid IN('930c534f-a6b6-4bc1-b26e-de5de8930cf9')
GROUP BY 1,2,3,4,7
But it does not seem to give me the correct min and max clocked in and clocked out time in that correct interval as below,
driver_uuid region_uuid interval_start_time interval_end_time clocked_in_time clocked_out_time score
232 bbv 2019-04-01 14:30:00.000 2019-04-01 15:00:00.000 2019-04-01 14:43:13.140 2019-04-01 22:30:46.043 0.173
When I add in these 2 lines,
JOIN TIME ON sched.clocked_in_time_utc BETWEEN time.interval_start_time AND time.interval_end_time
jOIN TIME ON sched.clocked_out_time_utc BETWEEN time.interval_start_time AND time.interval_end_time
iIt gives me an error as I dont think that is correct.
How can I correctly pull in the min and max clock in and clock out time for the correct interval? Meaning I only want the earliest clocked in time and the latest clocked out time in that per half hour interval start and end time.
I appreciate anybody looking ! Thanks

GROUP BY several hours

I have a table where our product records its activity log. The product starts working at 23:00 every day and usually works one or two hours. This means that once a batch started at 23:00, it finishes about 1:00am next day.
Now, I need to take statistics on how many posts are registered per batch but cannot figure out a script that would allow me achiving this. So far I have following SQL code:
SELECT COUNT(*), DATEPART(DAY,registrationtime),DATEPART(HOUR,registrationtime)
FROM RegistrationMessageLogEntry
WHERE registrationtime > '2014-09-01 20:00'
GROUP BY DATEPART(DAY, registrationtime), DATEPART(HOUR,registrationtime)
ORDER BY DATEPART(DAY, registrationtime), DATEPART(HOUR,registrationtime)
which results in following
count day hour
....
1189 9 23
8611 10 0
2754 10 23
6462 11 0
1885 11 23
I.e. I want the number for 9th 23:00 grouped with the number for 10th 00:00, 10th 23:00 with 11th 00:00 and so on. How could I do it?
You can do it very easily. Use DATEADD to add an hour to the original registrationtime. If you do so, all the registrationtimes will be moved to the same day, and you can simply group by the day part.
You could also do it in a more complicated way using CASE WHEN, but it's overkill on the view of this easy solution.
I had to do something similar a few days ago. I had fixed timespans for work shifts to group by where one of them could start on one day at 10pm and end the next morning at 6am.
What I did was:
Define a "shift date", which was simply the day with zero timestamp when the shift started for every entry in the table. I was able to do so by checking whether the timestamp of the entry was between 0am and 6am. In that case I took only the date of this DATEADD(dd, -1, entryDate), which returned the previous day for all entries between 0am and 6am.
I also added an ID for the shift. 0 for the first one (6am to 2pm), 1 for the second one (2pm to 10pm) and 3 for the last one (10pm to 6am).
I was then able to group over the shift date and shift IDs.
Example:
Consider the following source entries:
Timestamp SomeData
=============================
2014-09-01 06:01:00 5
2014-09-01 14:01:00 6
2014-09-02 02:00:00 7
Step one extended the table as follows:
Timestamp SomeData ShiftDay
====================================================
2014-09-01 06:01:00 5 2014-09-01 00:00:00
2014-09-01 14:01:00 6 2014-09-01 00:00:00
2014-09-02 02:00:00 7 2014-09-01 00:00:00
Step two extended the table as follows:
Timestamp SomeData ShiftDay ShiftID
==============================================================
2014-09-01 06:01:00 5 2014-09-01 00:00:00 0
2014-09-01 14:01:00 6 2014-09-01 00:00:00 1
2014-09-02 02:00:00 7 2014-09-01 00:00:00 2
If you add one hour to registrationtime, you will be able to group by the date part:
GROUP BY
CAST(DATEADD(HOUR, 1, registrationtime) AS date)
If the starting hour must be reflected accurately in the output (as 9, 23, 10, 23 rather than as 10, 0, 11, 0), you could obtain it as MIN(registrationtime) in the SELECT clause:
SELECT
count = COUNT(*),
day = DATEPART(DAY, MIN(registrationtime)),
hour = DATEPART(HOUR, MIN(registrationtime))
Finally, in case you are not aware, you can reference columns by their aliases in ORDER BY:
ORDER BY
day,
hour
just so that you do not have to repeat the expressions.
The below query will give you what you are expecting..
;WITH CTE AS
(
SELECT COUNT(*) Count, DATEPART(DAY,registrationtime) Day,DATEPART(HOUR,registrationtime) Hour,
RANK() over (partition by DATEPART(HOUR,registrationtime) order by DATEPART(DAY,registrationtime),DATEPART(HOUR,registrationtime)) Batch_ID
FROM RegistrationMessageLogEntry
WHERE registrationtime > '2014-09-01 20:00'
GROUP BY DATEPART(DAY, registrationtime), DATEPART(HOUR,registrationtime)
)
SELECT SUM(COUNT) Count,Batch_ID
FROM CTE
GROUP BY Batch_ID
ORDER BY Batch_ID
You can write a CASE statement as below
CASE WHEN DATEPART(HOUR,registrationtime) = 23
THEN DATEPART(DAY,registrationtime)+1
END,
CASE WHEN DATEPART(HOUR,registrationtime) = 23
THEN 0
END