Series Generating Functions and intervals? - sql

In paragraph 9.24. "Set Returning Functions" of the PostgreSQL 9.5 manual is an example with "generate_series" with which I disagree.
SELECT * FROM generate_series('2008-03-01 00:00'::timestamp,
'2008-03-04 12:00', '10 hours');
generate_series
---------------------
2008-03-01 00:00:00
2008-03-01 10:00:00
2008-03-01 20:00:00
2008-03-02 06:00:00
2008-03-02 16:00:00
2008-03-03 02:00:00
2008-03-03 12:00:00
2008-03-03 22:00:00
2008-03-04 08:00:00
(9 rows)
I explain why, if we talk about intervals of 10 hours in length, then the line: "2008-03-04 08:00:00" should not be displayed, since there are only 4 hours left.
The last line should be 2008-03-03 22:00:00.
With the problem of outputting the intervals entirely, I recently encountered how to solve this problem?

No, you are interpreting it wrong.
generate_series() starts with the first value and continues adding the interval until the value would exceed the end value.
That is exactly how it is defined and exactly what your example is doing. The boundaries are the start and end.
A simpler example uses numbers. This query:
SELECT *
FROM generate_series(0, 22, 5);
returns a series of multiples of 5 up to 22. That is, the last value is 20.

Related

Get rows corresponding to the same time ranges

Suppose I have two timestamps forming a range like
BETWEEN TIMESTAMP('2022-12-08 01:00:00 UTC') AND TIMESTAMP('2022-12-08 02:00:00 UTC')
I want to get all rows that fall into these time ranges by some timestamp column. The TIME function in BigQuery helps achieve this but using this function falls short when the timestamp range crosses the 00:00:00 clock. That is:
TIME(ts_col)
BETWEEN TIME(TIMESTAMP('2022-12-08 01:00:00 UTC'))
AND TIME(TIMESTAMP('2022-12-08 02:00:00 UTC'))
works, but the following will not:
TIME(ts_col)
BETWEEN TIME(TIMESTAMP('2022-12-07 23:00:00 UTC'))
AND TIME(TIMESTAMP('2022-12-08 01:00:00 UTC'))
e.g.
SELECT
TIME(TIMESTAMP('2022-12-08 00:00:00 UTC'))
BETWEEN TIME(TIMESTAMP('2022-12-07 23:00:00 UTC'))
AND TIME(TIMESTAMP('2022-12-08 01:00:00 UTC'))
returns false. Any ideas? I can guarantee that this range does not exceed 24 hours.
The TIME function will extract only the time part (eg. hour, minute, second) from a timestamp. So when used in your example, the range in BETWEEN is invalid because the first time (23:00:00) is greater then the second (01:00:00).
Removing the TIME function will work has expected:
SELECT
TIMESTAMP('2022-12-08 00:00:00 UTC')
BETWEEN TIMESTAMP('2022-12-07 23:00:00 UTC')
AND TIMESTAMP('2022-12-08 01:00:00 UTC')
Output:
More info: https://cloud.google.com/bigquery/docs/reference/standard-sql/time_functions#time

Interpolate hourly load of a selected months of a year from the same months of the previous year and the next year in python pandas?

I have the following three dataframes:
df1:
date_time system_load
01-01-2017 00:00:00 208111
01-01-2017 01:00:00 208311
01-01-2017 02:00:00 208311
01-01-2017 03:00:00 208011
............... ...
31-12-2017 20:00:00 208611
31-12-2017 21:00:00 208411
31-12-2017 22:00:00 208111
31-12-2017 23:00:00 208911
The system load values of df1 has no problem.
df2:
date_time system_load
01-01-2018 00:00:00 208111
01-01-2018 01:00:00 208311
01-01-2018 02:00:00 208311
01-01-2018 03:00:00 208011
............... ...
31-12-2018 20:00:00 209611
31-12-2018 21:00:00 209411
31-12-2018 22:00:00 209111
31-12-2018 23:00:00 209911
The system load values of df2 is missed starting from 06-03-2018 20:00:00 till up to 24-10-2018 22:00:00.
df3:
date_time system_load
01-01-2019 00:00:00 309119
01-01-2019 01:00:00 309391
01-01-2019 02:00:00 309811
01-01-2019 03:00:00 309711
............... ...
31-12-2019 20:00:00 309611
31-12-2019 21:00:00 309411
31-12-2019 22:00:00 309111
31-12-2019 23:00:00 309911
The system load values of df3 has no problem.
What I want is to interpolate in suitable way the missed hourly records in df2 using the corresponding df1 and df3 hourly records (06-03-2017 20:00:00 till up to 24-10-2017 22:00:00 and 06-03-2019 20:00:00 till up to 24-10-2019 22:00:00 respectively). Based on "Pierre D"'s valuable comment I attached my scaled data.
Here is a very basic strategy that just takes data from neighboring years to fill the missing values. The offset is chosen to be precisely 52 weeks, so as to reflect possible weekly seasonality.
# get the whole series together, and resample to have missing data as NaN:
s = pd.concat([df1, df2, df3])['system_load'].resample('H').asfreq()
offset = 52 * 7 * 24 # 52 weeks, 7 days/week, 24 hours/day
filler = pd.concat([s.shift(offset), s.shift(-offset)], axis=1).mean(axis=1)
out = s.where(~s.isna(), filler)
# optional: make a new df2 with the filled values
df2mod = out.truncate(
before='2018',
after=pd.Timestamp('2019') - pd.Timedelta(1)
).to_frame('system_load')
Notes:
out contains the "filled" series for the whole system_load using neighboring years.
we use pandas.DataFrame.mean() to build the filler series as the mean of the two neighboring years, in a way that takes care of NaN (e.g. if one year or the other has NaN, then the mean is the only non-NaN value).
this is one of the most basic ways of filling the missing data, and likely won't fool a careful observer. Depending on the intended usage of the reconstructed data, a more elaborate strategy should be considered. Data reconstruction is an active field of research, and there are sophisticated methods in the literature. For example, one could use a GAN to build a resulting series that would be very hard to discriminate from real data.

Using grouper to group a timestamp in a specific range

Suppose that I have a data-frame (DF). Index of this data-frame is timestamp from 11 AM to 6 PM every day and this data-frame contains 30 days. I want to group it every 30 minutes. This is the function I'm using:
out = DF.groupby(pd.Grouper(freq='30min'))
The start date of output is correct, but it considers the whole day (24h) for grouping. For example, In the new timestamp, I have something like this:
11:00:00
11:30:00
12:00:00
12:30:00
...
18:00:00
18:30:00
...
23:00:00
23:30:00
...
2:00:00
2:30:00
...
...
10:30:00
11:00:00
11:30:00
As a result, many outputs are empty because from 6:00 PM to 11 AM, I don't have any data.
One possible solution should be DatetimeIndex.floor:
out = DF.groupby(DF.index.floor('30min'))
Or use dropna after aggregate function:
out = DF.groupby(pd.Grouper(freq='30min')).mean().dropna()
As mentioned in comment to original post this is as expected. If you want to remove empty groups simply slice them afterwards. Assuming in this case you are using count to aggregate:
df = df.groupby(pd.Grouper(freq='30min')).count()
df = df[df > 0]

Reset the date portion to the last of the month while preserving the time

Is there any way to reset the date portion to the last day of the month while preserving the time? For example:
2018-01-02 23:00:00 -> 2018-01-31 23:00:00
2018-04-04 10:00:00 -> 2018-04-30 10:00:00
The Oracle function last_day() does exactly this. Try:
select last_day(sysdate), sysdate
from dual
to see how it works.
Ironically, I usually find the preservation of the date to be counterintuitive, so my usual usage is more like:
select last_day(trunc(sysdate))
from dual

Sql Select Time Stamp and Represent it as 9 Hours Back

Good day. I have a table that is collecting data in Universal Time Coordinated timestamps. However, the location is 9 hours back from this time. I am writing a query that gets the time-stamp and the value but 'casts' the timestamp 9 hours back since thats when it got recorded with respect to that location.
My issue is that I keep subtracting days not hours even though I specified hours in my 'datediff' and 'dateadd'. How do I select a timestamp and the value but represent that timestamp as 9 hours back? Thanks for any help.
select DATEADD(hour, DATEDIFF(hour,9,TimeUTC),0) as DateActual, Value
From TableData
Data
2015-12-15 00:00:00 45
2015-12-15 00:00:00 54
Current results
2015-12-06 00:00:00 45
2015-12-06 00:00:00 54
Desired results
2015-12-14 15:00:00 45
2015-12-14 15:00:00 54