Resample a datetimeIndex start day wrong - pandas

Source:
import pandas as pd
import numpy as np
cols = ['Date', 'Time', 'Load', 'Battery', 'Panel',
'Wind', 'Temp', 'Humidity', 'Volt']
data = pd.read_csv('test.csv',delimiter=';',header=0,names=cols,
decimal=',',parse_dates[[0,1]],
infer_datetime_format=True)
data.set_index('Date_Time',inplace=True)
I have this data frame:
In [126]: data.head()
Out[126]:
Load Battery Panel Wind Temp Humidity Volt
Date_Time
2018-07-31 13:07:15 13.3 326.3 353.1 0.98 33.93 21.92 3.89
2018-07-31 13:08:15 14.0 314.4 342.5 0.59 33.88 21.84 3.88
2018-07-31 13:09:16 13.4 309.6 335.5 0.39 33.84 22.14 3.88
2018-07-31 13:10:16 13.8 285.1 313.8 2.55 33.71 23.18 3.88
2018-07-31 13:11:16 13.6 292.9 314.7 2.03 33.62 23.25 3.88
......
with other 93000 rows. from 2018-07-31 to 2018-04-10. I'd like to resample by taking the sum of values for each 10minute frame. So I tried:
In [127]: data.resample('10min',closed='left',label='left').sum()
Out[127]:
Load Battery Panel Wind Temp Humidity Volt
Date_Time
2018-01-08 00:00:00 136.9 -140.6 -2.9 19.06 291.27 245.63 39.45
2018-01-08 00:10:00 137.3 -140.7 -3.1 15.14 290.62 244.88 39.42
2018-01-08 00:20:00 137.4 -140.4 -2.3 18.03 288.61 246.44 39.44
2018-01-08 00:30:00 137.5 -140.4 -2.2 12.61 286.97 246.83 39.43
That is close to what I expect, but the 'resample' remove all the data from the first day (I suspect maybe because the series do not start at midnight), what is the proper way to do the resampling? There are two issues:
The first day is missing in the result, i.e. all data removed and the resampled dataframe starts in the first of august and not on 07/31.
It is ok to consider intervals that starts at midnight and are so, perfectly multiple of 10min (so, ok for 00:00, 10:00, 20:00) but then I expect that the first grouping is:
2018-07-31 13:07:15 13.3 326.3 353.1 0.98 33.93 21.92 3.89
2018-07-31 13:08:15 14.0 314.4 342.5 0.59 33.88 21.84 3.88
2018-07-31 13:09:16 13.4 309.6 335.5 0.39 33.84 22.14 3.88
and then from 13:10:16, of course in the first day of the dataset and not on the second.
Ok. I solved it using:
x = data['2018-07-31'].resample('10min').sum()
y = data.resample('10min',closed='left',label='left').sum()
r = pd.concat([x,y])
but I think that this must be a form of bug in resample.

For output that starts at exactly 2018-07-31 13:07:15, you need to add in the argument base: "the origin of the aggregated intervals": documentation.
Example code:
start = pd.to_datetime('2018-07-31 13:07:15', format='%Y-%m-%d %H:%M:%S')
minutes = pd.date_range(start, start + timedelta(10), freq='min')
df = pd.DataFrame({'Date_Time': minutes, 'Load': np.random.randint(13, size=len(minutes))})
df.set_index('Date_Time', inplace=True)
df.resample('10min', closed='left', label='left', base=7.25).sum()
Result:
Date_Time Load
2018-07-31 13:07:15 11
2018-07-31 13:17:15 1
2018-07-31 13:27:15 6

Related

How to add new column in dataframe based on forecast cycle 00 and 12 UTC using pandas

I have a dataframe with data from two forecast cycles, one starting at 00 UTC and going up to 168 hours forecast (Forecast (valid time column) and another cycle starting at 12 UTC and also going up to 168 hours forecast. Based on dataframe that I have below, I would like to create a column called cycle that corresponds to the Date-Time column of which forecast cycle the data refers. For example:
Date-Time Cycle
2020-07-16 00:00:00 00
2020-07-16 00:00:00 12
How can I do this?
My dataframe looks like this:
Array file
IIUC, you can use pandas.Series.where with pandas.Series.ffill :
import numpy
df = pd.read_csv("df.csv", sep=";", index_col=0, usecols=[0,1,2])
​
df['Date-Time'] = pd.to_datetime(df['Date-Time'])
​
#is it the start of the cycle ?
m = df["Forecast (valid time)"].eq(0)
df["Cycle"] = df["Date-Time"].dt.hour.where(m).ffill()
Output :
print(df.groupby("Cycle").head(5))
Date-Time Forecast (valid time) Cycle
0 2020-07-16 00:00:00 0.0 0
1 2020-07-16 03:00:00 3.0 0
2 2020-07-16 06:00:00 6.0 0
3 2020-07-16 09:00:00 9.0 0
4 2020-07-16 12:00:00 12.0 0
57 2020-07-16 12:00:00 0.0 12
58 2020-07-16 15:00:00 3.0 12
59 2020-07-16 18:00:00 6.0 12
60 2020-07-16 21:00:00 9.0 12
61 2020-07-17 00:00:00 12.0 12

How to match Datetimeindex for all but the year?

I have a dataset with missing values and a Datetimeindex. I would like to fill this values with the mean values of other values reported at the same month, day and hour. If there is no values reported at this specific month/day/hour for all years I would like to get the interpolated value mean values of the nearest hour reported. How can I achieve this? Right now my approach is this:
df_Na = df_Na[df_Na['Generation'].isna()]
df_raw = df_raw[~df_raw['Generation'].isna()]
# reduce to month
same_month = df_raw[df_raw.index.month.isin(df_Na.index.month)]
# reduce to same day
same_day = same_month[same_month.index.day.isin(df_Na.index.day)]
# reduce to hour
same_hour = same_day[same_day.index.hour.isin(df_Na.index.hour)]
df_Na are all missing values I liked to fill and df_raw are all reported values from which I liked to get the mean value. I have a huge dataset which is why I would like to avoid a for loop at all cost.
My Data looks like this:
df_Na
Generation
2017-12-02 19:00:00 NaN
2021-01-12 00:00:00 NaN
2021-01-12 01:00:00 NaN
..............................
2021-02-12 20:00:00 NaN
2021-02-12 21:00:00 NaN
2021-02-12 22:00:00 NaN
df_raw
Generation
2015-09-12 00:00:00 0.0
2015-09-12 01:00:00 19.0
2015-09-12 02:00:00 0.0
..............................
2021-12-11 21:00:00 0.0
2021-12-11 22:00:00 180.0
2021-12-11 23:00:00 0.0
Use GroupBy.transform with mean for averages per MM-DD HH and replace missing values by DataFrame.fillna:
df = df.fillna(df.groupby(df.index.strftime('%m-%d %H')).transform('mean'))
And then if necessary add DataFrame.interpolate:
df = df.interpolate(method='nearest')

upsampling timeseries from daily to hourly

I am using data below, which is saved in a CSV file, and trying to convert it to hourly using linear interpolation. However, not successful.
Code:
import pandas as pd
df = pd.read_csv('d:/Python/resampling/FairyLake.csv')
df[ 'Date' ] = pd.to_datetime(df['Date'])
df.set_index('Date').resample('M').interpolate()
print(df)
Data
Date,Discharge
1/3/2008,0.05865
1/4/2008,0.105812
1/5/2008,0.191388
1/6/2008,0.315378
1/7/2008,0.477782
1/8/2008,0.6786
1/9/2008,0.917832
1/10/2008,0.783875701
1/11/2008,0.65678957
1/12/2008,0.545651187
1/13/2008,0.44222808
1/14/2008,0.353907613
1/15/2008,0.27414753
Results
Date Discharge
0 2008-01-03 0.058650
1 2008-01-04 0.105812
2 2008-01-05 0.191388
3 2008-01-06 0.315378
4 2008-01-07 0.477782
5 2008-01-08 0.678600
6 2008-01-09 0.917832
7 2008-01-10 0.783876
8 2008-01-11 0.656790
9 2008-01-12 0.545651
10 2008-01-13 0.442228
11 2008-01-14 0.353908
12 2008-01-15 0.274148
Two things:
resample interpolate should be hourly (H)
results need to be assigned back df = ...:
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date').resample('H').interpolate()
df:
Discharge
Date
2008-01-03 00:00:00 0.058650
2008-01-03 01:00:00 0.060615
2008-01-03 02:00:00 0.062580
2008-01-03 03:00:00 0.064545
2008-01-03 04:00:00 0.066510
... ...
2008-01-14 20:00:00 0.287441
2008-01-14 21:00:00 0.284118
2008-01-14 22:00:00 0.280794
2008-01-14 23:00:00 0.277471
2008-01-15 00:00:00 0.274148

Pandas resample only when makes sense

I have a time series that is very irregular. The difference in time, between two records can be 1s or 10 days.
I want to resample the data every 1h, but only when the sequential records are less than 1h.
How to approach this, without making too many loops?
In the example above, I would like to resample only rows 5-6 (delta difference is 10s) and rows 6-7 (delta difference is 50min).
The others should remain as they are.
tmp=vals[['datumtijd','filter data']]
datumtijd filter data
0 1970-11-01 00:00:00 129.0
1 1970-12-01 00:00:00 143.0
2 1971-01-05 00:00:00 151.0
3 1971-02-01 00:00:00 151.0
4 1971-03-01 00:00:00 163.0
5 1971-03-01 00:00:10 163.0
6 1971-03-01 00:00:20 163.0
7 1971-03-01 00:01:10 163.0
8 1971-03-01 00:04:10 163.0
.. ... ...
244 1981-08-19 00:00:00 102.0
245 1981-09-02 00:00:00 98.0
246 1981-09-17 00:00:00 92.0
247 1981-10-01 00:00:00 89.0
248 1981-10-19 00:00:00 92.0
You can be a little explicit about this by using groupby on the hour-floor of the time stamps:
grouped = df.groupby(df['datumtijd'].dt.floor('1H')).mean()
This is explicitly looking for the hour of each existing data point and grouping the matching ones.
But you can also just do the resample and then filter out the empty data, as pandas can still do this pretty quickly:
resampled = df.resample('1H', on='datumtijd').mean().dropna()
In either case, you get the following (note that I changed the last time stamp just so that the console would show the hours):
filter data
datumtijd
1970-11-01 00:00:00 129.0
1970-12-01 00:00:00 143.0
1971-01-05 00:00:00 151.0
1971-02-01 00:00:00 151.0
1971-03-01 00:00:00 163.0
1981-08-19 00:00:00 102.0
1981-09-02 00:00:00 98.0
1981-09-17 00:00:00 92.0
1981-10-01 00:00:00 89.0
1981-10-19 03:00:00 92.0
One quick clarification also. In your example, rows 5-8 all occur within the same hour, so they all get grouped together (hour:minute:second)!.
Also, see this related post.

Creating values from datetime objects in certain fixed divisions

I am trying to create a new column, in which e.g. the time 14:02 should be saved as 14.0, whereas 14:16 should be 14.5. This would equal half-hour units. Of course 15min units should also be creatable and so on. This is my approach for full hours, but I need a higher resolution.
df["Time"] = df.StartDateTime.apply(lambda x: x.hour)
So long as the units evenly divide an hour you can round with that frequency and then divide by an hour.
import pandas as pd
df = pd.DataFrame({'Time': pd.timedelta_range('14:00:00', freq='4min', periods=10)})
for freq in ['30min', '15min', '20min', '10min']:
df[freq] = df['Time'].dt.round(freq)/pd.Timedelta('1H')
Time 30min 15min 20min 10min
0 14:00:00 14.0 14.00 14.000000 14.000000
1 14:04:00 14.0 14.00 14.000000 14.000000
2 14:08:00 14.0 14.25 14.000000 14.166667
3 14:12:00 14.0 14.25 14.333333 14.166667
4 14:16:00 14.5 14.25 14.333333 14.333333
5 14:20:00 14.5 14.25 14.333333 14.333333
6 14:24:00 14.5 14.50 14.333333 14.333333
7 14:28:00 14.5 14.50 14.333333 14.500000
8 14:32:00 14.5 14.50 14.666667 14.500000
9 14:36:00 14.5 14.50 14.666667 14.666667
If you start from a datetime64[ns] column you can isolate the time by subtracting off the normalized date. For example:
df = pd.DataFrame({'Time': pd.date_range('2010-01-01 14:00:00', freq='4min', periods=5)})
df['Time_only'] = df['Time'] - df['Time'].dt.normalize()
# Time Time_only
#0 2010-01-01 14:00:00 14:00:00
#1 2010-01-01 14:04:00 14:04:00
#2 2010-01-01 14:08:00 14:08:00
#3 2010-01-01 14:12:00 14:12:00
#4 2010-01-01 14:16:00 14:16:00
print(df.dtypes)
#Time datetime64[ns]
#Time_only timedelta64[ns]
#dtype: object