How to match Datetimeindex for all but the year? - pandas

I have a dataset with missing values and a Datetimeindex. I would like to fill this values with the mean values of other values reported at the same month, day and hour. If there is no values reported at this specific month/day/hour for all years I would like to get the interpolated value mean values of the nearest hour reported. How can I achieve this? Right now my approach is this:
df_Na = df_Na[df_Na['Generation'].isna()]
df_raw = df_raw[~df_raw['Generation'].isna()]
# reduce to month
same_month = df_raw[df_raw.index.month.isin(df_Na.index.month)]
# reduce to same day
same_day = same_month[same_month.index.day.isin(df_Na.index.day)]
# reduce to hour
same_hour = same_day[same_day.index.hour.isin(df_Na.index.hour)]
df_Na are all missing values I liked to fill and df_raw are all reported values from which I liked to get the mean value. I have a huge dataset which is why I would like to avoid a for loop at all cost.
My Data looks like this:
df_Na
Generation
2017-12-02 19:00:00 NaN
2021-01-12 00:00:00 NaN
2021-01-12 01:00:00 NaN
..............................
2021-02-12 20:00:00 NaN
2021-02-12 21:00:00 NaN
2021-02-12 22:00:00 NaN
df_raw
Generation
2015-09-12 00:00:00 0.0
2015-09-12 01:00:00 19.0
2015-09-12 02:00:00 0.0
..............................
2021-12-11 21:00:00 0.0
2021-12-11 22:00:00 180.0
2021-12-11 23:00:00 0.0

Use GroupBy.transform with mean for averages per MM-DD HH and replace missing values by DataFrame.fillna:
df = df.fillna(df.groupby(df.index.strftime('%m-%d %H')).transform('mean'))
And then if necessary add DataFrame.interpolate:
df = df.interpolate(method='nearest')

Related

mininum value of a resample (not 0)

i have a dataframe (df) indexed by dates (freq: 15 minutes): (little example)
datetime
Value
2019-09-02 16:15:00
0.00
2019-09-02 16:30:00
3.07
2019-09-02 16:45:00
1.05
And i want to resample my dataframe to freq: 1 month. Also I need calculate the min value in this month reaching this:
df_min = df.resample('1M').min()
Up to this point, all good but i need the min value not be 0, so i want something like min(i>0) but i dont know how to get it
here is one way to do it
assumption: datetime is an index
# make the 0 as nan and take the min
df_min= df.replace(0, np.nan).resample('1M').min()
Value
datetime
2019-09-30 1.05

How to find consecutive zeros in time series

I have a data frame that its index is hourly date and its column is counts. Looks like the following table :
date counts
2017-03-31 00:00:00+00:00 0.0
2017-03-31 01:00:00+00:00 0.0
2017-03-31 02:00:00+00:00 0.0
2017-03-31 03:00:00+00:00 0.0
2017-03-31 04:00:00+00:00 0.0
... ...
2022-06-19 19:00:00+00:00 6.0
2022-06-19 20:00:00+00:00 6.0
2022-06-19 21:00:00+00:00 1.0
2022-06-19 22:00:00+00:00 1.0
2022-06-19 23:00:00+00:00 1.0
If there are 15 hours worth of zero counts in a row, they are considered as error and I want to flag them. Data frame is not complete and there are missing dates(gaps) in the data.
I tried to use resampling the data frame to 15 hours and find dates with sum of resampled 15 hours are zero but didn't give me the correct answer
If counts is guaranteed to be non-negative, you can use rolling and check for the max value:
df["is_error"] = df["counts"].rolling(15).max() == 0
If counts can be negative, you have to check both min and max:
r = df["counts"].rolling(15)
df["is_error"] = r.min().eq(0) & r.max().eq(0)
Assuming the dates are sorted, group by successive 0 and get the group size, if ≥ 15 flag it True:
m = df['counts'].ne(0)
c = df.groupby(m.cumsum())['counts'].transform('size')
df['error'] = c.gt(15).mask(m, False)

Interpolate hourly load of a selected months of a year from the same months of the previous year and the next year in python pandas?

I have the following three dataframes:
df1:
date_time system_load
01-01-2017 00:00:00 208111
01-01-2017 01:00:00 208311
01-01-2017 02:00:00 208311
01-01-2017 03:00:00 208011
............... ...
31-12-2017 20:00:00 208611
31-12-2017 21:00:00 208411
31-12-2017 22:00:00 208111
31-12-2017 23:00:00 208911
The system load values of df1 has no problem.
df2:
date_time system_load
01-01-2018 00:00:00 208111
01-01-2018 01:00:00 208311
01-01-2018 02:00:00 208311
01-01-2018 03:00:00 208011
............... ...
31-12-2018 20:00:00 209611
31-12-2018 21:00:00 209411
31-12-2018 22:00:00 209111
31-12-2018 23:00:00 209911
The system load values of df2 is missed starting from 06-03-2018 20:00:00 till up to 24-10-2018 22:00:00.
df3:
date_time system_load
01-01-2019 00:00:00 309119
01-01-2019 01:00:00 309391
01-01-2019 02:00:00 309811
01-01-2019 03:00:00 309711
............... ...
31-12-2019 20:00:00 309611
31-12-2019 21:00:00 309411
31-12-2019 22:00:00 309111
31-12-2019 23:00:00 309911
The system load values of df3 has no problem.
What I want is to interpolate in suitable way the missed hourly records in df2 using the corresponding df1 and df3 hourly records (06-03-2017 20:00:00 till up to 24-10-2017 22:00:00 and 06-03-2019 20:00:00 till up to 24-10-2019 22:00:00 respectively). Based on "Pierre D"'s valuable comment I attached my scaled data.
Here is a very basic strategy that just takes data from neighboring years to fill the missing values. The offset is chosen to be precisely 52 weeks, so as to reflect possible weekly seasonality.
# get the whole series together, and resample to have missing data as NaN:
s = pd.concat([df1, df2, df3])['system_load'].resample('H').asfreq()
offset = 52 * 7 * 24 # 52 weeks, 7 days/week, 24 hours/day
filler = pd.concat([s.shift(offset), s.shift(-offset)], axis=1).mean(axis=1)
out = s.where(~s.isna(), filler)
# optional: make a new df2 with the filled values
df2mod = out.truncate(
before='2018',
after=pd.Timestamp('2019') - pd.Timedelta(1)
).to_frame('system_load')
Notes:
out contains the "filled" series for the whole system_load using neighboring years.
we use pandas.DataFrame.mean() to build the filler series as the mean of the two neighboring years, in a way that takes care of NaN (e.g. if one year or the other has NaN, then the mean is the only non-NaN value).
this is one of the most basic ways of filling the missing data, and likely won't fool a careful observer. Depending on the intended usage of the reconstructed data, a more elaborate strategy should be considered. Data reconstruction is an active field of research, and there are sophisticated methods in the literature. For example, one could use a GAN to build a resulting series that would be very hard to discriminate from real data.

pandas groupby several criteria

I have a dataframe that looks like this
which contains every minute of a year.
I need to simplify it on hourly base and to get only hours of the year and then maximum of Reserved and Used columns for the respective hours.
I made this, which works, but not totally for my purposes
df = df.assign(date=df.date.dt.round('H'))
df1 = df.groupby('date').agg({'Reserved': ['max'], 'Used': ['max'] }).droplevel(1, axis=1).reset_index()
which just groups the minutes into hours.
date Reserved Used
0 2020-01-01 00:00:00 2176 0.0
1 2020-01-01 01:00:00 2176 0.0
2 2020-01-01 02:00:00 2176 0.0
3 2020-01-01 03:00:00 2176 0.0
4 2020-01-01 04:00:00 2176 0.0
... ... ... ...
8780 2020-12-31 20:00:00 3450 50.0
8781 2020-12-31 21:00:00 3450 0.0
8782 2020-12-31 22:00:00 3450 0.0
8783 2020-12-31 23:00:00 3450 0.0
8784 2021-01-01 00:00:00 3450 0.0
Now I need group it more to plot several curves, containing only 24 points (for every hour) based on several criteria
average used and reserved for the whole year (so to group together every 00 hour, every 01 hour, etc.)
average used and reserved for every month (so to group every 00 hour, 01 hour etc for each month individually)
average used and reserved for weekdays and for weekends
I know this is only the similar groupby as before, but I somehow miss the logic of doing it.
Could anybody help?
Thanks.

Pandas: How to fill missing period/datetime values in a multiindex time series?

I have a multiindex dataframe where one of the indexes is a Period or DateTime. It has some missing values like the one below:
dt = pd.DataFrame(zip(['x']*4+['y']*4,
range(8),
list(pd.period_range('2020-08-02T00:00:00', '2020-08-02T03:00:00', freq='H'))*2)
,columns=['a', 'b', 'd']).set_index(['a', 'd'])
dt = dt.drop([('x',pd.Period('2020-08-02 01:00', 'H')),
('y',pd.Period('2020-08-02 01:00', 'H'))])
dt
I'd like to fill the missing period values with NaN. The end result would be:
If I had a time series with a simple index, it would be easy: dt.resample('H').first(). How should I do it in this multi-index timeseries?
According to your comment under Henry Yik, I assume that all time series are within the same range, so I guess you can use reindex and create the MultiIndex.from_product like:
dt_ = dt.reindex(pd.MultiIndex.from_product(
[dt.index.get_level_values('a').unique(),
pd.date_range(dt.index.get_level_values('d').min(),
dt.index.get_level_values('d').max(),
freq='H')],
names=dt.index.names))
print(dt_)
b
a d
x 2020-08-02 00:00:00 0.0
2020-08-02 01:00:00 NaN
2020-08-02 02:00:00 2.0
2020-08-02 03:00:00 3.0
y 2020-08-02 00:00:00 4.0
2020-08-02 01:00:00 NaN
2020-08-02 02:00:00 6.0
2020-08-02 03:00:00 7.0
I think you could simply reset the index for a groupby:
dt = dt.reset_index("a").groupby("a").resample('H').first()
dt["a"] = dt["a"].ffill()
print (dt)
a b
a d
x 2020-08-02 00:00 x 0.0
2020-08-02 01:00 x NaN
2020-08-02 02:00 x 2.0
2020-08-02 03:00 x 3.0
y 2020-08-02 00:00 y 4.0
2020-08-02 01:00 y NaN
2020-08-02 02:00 y 6.0
2020-08-02 03:00 y 7.0