Get rows corresponding to the same time ranges - sql

Suppose I have two timestamps forming a range like
BETWEEN TIMESTAMP('2022-12-08 01:00:00 UTC') AND TIMESTAMP('2022-12-08 02:00:00 UTC')
I want to get all rows that fall into these time ranges by some timestamp column. The TIME function in BigQuery helps achieve this but using this function falls short when the timestamp range crosses the 00:00:00 clock. That is:
TIME(ts_col)
BETWEEN TIME(TIMESTAMP('2022-12-08 01:00:00 UTC'))
AND TIME(TIMESTAMP('2022-12-08 02:00:00 UTC'))
works, but the following will not:
TIME(ts_col)
BETWEEN TIME(TIMESTAMP('2022-12-07 23:00:00 UTC'))
AND TIME(TIMESTAMP('2022-12-08 01:00:00 UTC'))
e.g.
SELECT
TIME(TIMESTAMP('2022-12-08 00:00:00 UTC'))
BETWEEN TIME(TIMESTAMP('2022-12-07 23:00:00 UTC'))
AND TIME(TIMESTAMP('2022-12-08 01:00:00 UTC'))
returns false. Any ideas? I can guarantee that this range does not exceed 24 hours.

The TIME function will extract only the time part (eg. hour, minute, second) from a timestamp. So when used in your example, the range in BETWEEN is invalid because the first time (23:00:00) is greater then the second (01:00:00).
Removing the TIME function will work has expected:
SELECT
TIMESTAMP('2022-12-08 00:00:00 UTC')
BETWEEN TIMESTAMP('2022-12-07 23:00:00 UTC')
AND TIMESTAMP('2022-12-08 01:00:00 UTC')
Output:
More info: https://cloud.google.com/bigquery/docs/reference/standard-sql/time_functions#time

Related

Google Bigquery - Create time series of number of active records

I'm trying to create a timeseries in google bigquery SQL. My data is a series of time ranges covering the period of activity for that record. Here is an example:
Start End
2020-11-01 21:04:00 UTC 2020-11-02 07:15:00 UTC
2020-11-01 21:45:00 UTC 2020-11-02 04:00:00 UTC
2020-11-01 22:00:00 UTC 2020-11-02 09:48:00 UTC
2020-11-01 22:00:00 UTC 2020-11-02 06:00:00 UTC
I wish to create a new table to total the number of active records within a 15 minute block. "21:00:00" would for example be 21:00 to 21:14.59. My desired output for the above would be:
Period Active_Records
2020-11-01 21:00:00 1
2020-11-01 21:15:00 1
2020-11-01 21:30:00 1
2020-11-01 21:45:00 2
2020-11-01 22:00:00 4
2020-11-01 22:15:00 4
etc until the end of the last active range.
I would also like to be able to generate this on the fly by querying a date range and having it return every 15 minute block in the range and how many active records there was in that period.
Any assistance would be greatly appreciated.
Below is for BigQuery Standard SQL
#standardSQL
select ts as period, count(1) as Active_Records
from unnest((
select generate_timestamp_array(timestamp_trunc(min(start), hour), max(`end`), interval 15 minute)
from `project.dataset.table`
)) ts
join `project.dataset.table`
on not (`end` < ts or start > timestamp_add(ts, interval 15 * 60 - 1 second))
group by ts
if to apply to sample data from your question - output is

Pandas Difference from two DateTime64[ns] columns - Shows "-1 days" when goes through midnight

Goal is to compute delta between two times, each in separate DF columns and in 24-Hour clock format, and add to a new column "triptime"
Here is my input code, which has no dates, just 24hour clock strings.
df = pd.DataFrame({'DepartureTime': ['2330', '1700', '0900'], 'ArrivalTime': ['0030','1900','1100']})
Here is my attempt
df['DepartureTime'] = pd.to_datetime(df.DepartureTime, format='%H%M')
df['ArrivalTime'] = pd.to_datetime(df.ArrivalTime, format='%H%M')
df['triptime'] = df.ArrivalTime - df.DepartureTime
Which outputs a problem as can be seen in the first row below. Unfortunately my pipeline data assumes no change in dates. Any guidance on how I can have the triptime column showing the actual trip time, without prefix of days?
IIUC you can add astype() to return only the difference in hours.
df['triptime'] = (df.ArrivalTime - df.DepartureTime).astype('timedelta64[h]')
#output
DepartureTime ArrivalTime triptime
0 1900-01-01 23:30:00 1900-01-01 00:30:00 -23.0
1 1900-01-01 17:00:00 1900-01-01 19:00:00 2.0
2 1900-01-01 09:00:00 1900-01-01 11:00:00 2.0
One way to get the interval when the day turns is to select all values ​​less than zero and add 24. Apparently it solves the problem but it is not something I like. It seems highly susceptible to errors.
df.loc[df['triptime'] < 0, 'triptime'] = df['triptime'] + 24
#output
DepartureTime ArrivalTime triptime
0 1900-01-01 23:30:00 1900-01-01 00:30:00 1.0
1 1900-01-01 17:00:00 1900-01-01 19:00:00 2.0
2 1900-01-01 09:00:00 1900-01-01 11:00:00 2.0
The most correct and fail-safe way would be to have, in addition to the time of departure and arrival, the entire dates
If after calculations you want to remove the dates and keep only the hours, use .dt.time
df['DepartureTime'] = df['DepartureTime'].dt.time
df['ArrivalTime'] = df['ArrivalTime'].dt.time
#output
DepartureTime ArrivalTime triptime
0 23:30:00 00:30:00 1.0
1 17:00:00 19:00:00 2.0
2 09:00:00 11:00:00 2.0

Using grouper to group a timestamp in a specific range

Suppose that I have a data-frame (DF). Index of this data-frame is timestamp from 11 AM to 6 PM every day and this data-frame contains 30 days. I want to group it every 30 minutes. This is the function I'm using:
out = DF.groupby(pd.Grouper(freq='30min'))
The start date of output is correct, but it considers the whole day (24h) for grouping. For example, In the new timestamp, I have something like this:
11:00:00
11:30:00
12:00:00
12:30:00
...
18:00:00
18:30:00
...
23:00:00
23:30:00
...
2:00:00
2:30:00
...
...
10:30:00
11:00:00
11:30:00
As a result, many outputs are empty because from 6:00 PM to 11 AM, I don't have any data.
One possible solution should be DatetimeIndex.floor:
out = DF.groupby(DF.index.floor('30min'))
Or use dropna after aggregate function:
out = DF.groupby(pd.Grouper(freq='30min')).mean().dropna()
As mentioned in comment to original post this is as expected. If you want to remove empty groups simply slice them afterwards. Assuming in this case you are using count to aggregate:
df = df.groupby(pd.Grouper(freq='30min')).count()
df = df[df > 0]

Series Generating Functions and intervals?

In paragraph 9.24. "Set Returning Functions" of the PostgreSQL 9.5 manual is an example with "generate_series" with which I disagree.
SELECT * FROM generate_series('2008-03-01 00:00'::timestamp,
'2008-03-04 12:00', '10 hours');
generate_series
---------------------
2008-03-01 00:00:00
2008-03-01 10:00:00
2008-03-01 20:00:00
2008-03-02 06:00:00
2008-03-02 16:00:00
2008-03-03 02:00:00
2008-03-03 12:00:00
2008-03-03 22:00:00
2008-03-04 08:00:00
(9 rows)
I explain why, if we talk about intervals of 10 hours in length, then the line: "2008-03-04 08:00:00" should not be displayed, since there are only 4 hours left.
The last line should be 2008-03-03 22:00:00.
With the problem of outputting the intervals entirely, I recently encountered how to solve this problem?
No, you are interpreting it wrong.
generate_series() starts with the first value and continues adding the interval until the value would exceed the end value.
That is exactly how it is defined and exactly what your example is doing. The boundaries are the start and end.
A simpler example uses numbers. This query:
SELECT *
FROM generate_series(0, 22, 5);
returns a series of multiples of 5 up to 22. That is, the last value is 20.

Reset the date portion to the last of the month while preserving the time

Is there any way to reset the date portion to the last day of the month while preserving the time? For example:
2018-01-02 23:00:00 -> 2018-01-31 23:00:00
2018-04-04 10:00:00 -> 2018-04-30 10:00:00
The Oracle function last_day() does exactly this. Try:
select last_day(sysdate), sysdate
from dual
to see how it works.
Ironically, I usually find the preservation of the date to be counterintuitive, so my usual usage is more like:
select last_day(trunc(sysdate))
from dual