Pandas: How to fill missing period/datetime values in a multiindex time series? - pandas

I have a multiindex dataframe where one of the indexes is a Period or DateTime. It has some missing values like the one below:
dt = pd.DataFrame(zip(['x']*4+['y']*4,
range(8),
list(pd.period_range('2020-08-02T00:00:00', '2020-08-02T03:00:00', freq='H'))*2)
,columns=['a', 'b', 'd']).set_index(['a', 'd'])
dt = dt.drop([('x',pd.Period('2020-08-02 01:00', 'H')),
('y',pd.Period('2020-08-02 01:00', 'H'))])
dt
I'd like to fill the missing period values with NaN. The end result would be:
If I had a time series with a simple index, it would be easy: dt.resample('H').first(). How should I do it in this multi-index timeseries?

According to your comment under Henry Yik, I assume that all time series are within the same range, so I guess you can use reindex and create the MultiIndex.from_product like:
dt_ = dt.reindex(pd.MultiIndex.from_product(
[dt.index.get_level_values('a').unique(),
pd.date_range(dt.index.get_level_values('d').min(),
dt.index.get_level_values('d').max(),
freq='H')],
names=dt.index.names))
print(dt_)
b
a d
x 2020-08-02 00:00:00 0.0
2020-08-02 01:00:00 NaN
2020-08-02 02:00:00 2.0
2020-08-02 03:00:00 3.0
y 2020-08-02 00:00:00 4.0
2020-08-02 01:00:00 NaN
2020-08-02 02:00:00 6.0
2020-08-02 03:00:00 7.0

I think you could simply reset the index for a groupby:
dt = dt.reset_index("a").groupby("a").resample('H').first()
dt["a"] = dt["a"].ffill()
print (dt)
a b
a d
x 2020-08-02 00:00 x 0.0
2020-08-02 01:00 x NaN
2020-08-02 02:00 x 2.0
2020-08-02 03:00 x 3.0
y 2020-08-02 00:00 y 4.0
2020-08-02 01:00 y NaN
2020-08-02 02:00 y 6.0
2020-08-02 03:00 y 7.0

Related

How to add new column in dataframe based on forecast cycle 00 and 12 UTC using pandas

I have a dataframe with data from two forecast cycles, one starting at 00 UTC and going up to 168 hours forecast (Forecast (valid time column) and another cycle starting at 12 UTC and also going up to 168 hours forecast. Based on dataframe that I have below, I would like to create a column called cycle that corresponds to the Date-Time column of which forecast cycle the data refers. For example:
Date-Time Cycle
2020-07-16 00:00:00 00
2020-07-16 00:00:00 12
How can I do this?
My dataframe looks like this:
Array file
IIUC, you can use pandas.Series.where with pandas.Series.ffill :
import numpy
df = pd.read_csv("df.csv", sep=";", index_col=0, usecols=[0,1,2])
​
df['Date-Time'] = pd.to_datetime(df['Date-Time'])
​
#is it the start of the cycle ?
m = df["Forecast (valid time)"].eq(0)
df["Cycle"] = df["Date-Time"].dt.hour.where(m).ffill()
Output :
print(df.groupby("Cycle").head(5))
Date-Time Forecast (valid time) Cycle
0 2020-07-16 00:00:00 0.0 0
1 2020-07-16 03:00:00 3.0 0
2 2020-07-16 06:00:00 6.0 0
3 2020-07-16 09:00:00 9.0 0
4 2020-07-16 12:00:00 12.0 0
57 2020-07-16 12:00:00 0.0 12
58 2020-07-16 15:00:00 3.0 12
59 2020-07-16 18:00:00 6.0 12
60 2020-07-16 21:00:00 9.0 12
61 2020-07-17 00:00:00 12.0 12

How to match Datetimeindex for all but the year?

I have a dataset with missing values and a Datetimeindex. I would like to fill this values with the mean values of other values reported at the same month, day and hour. If there is no values reported at this specific month/day/hour for all years I would like to get the interpolated value mean values of the nearest hour reported. How can I achieve this? Right now my approach is this:
df_Na = df_Na[df_Na['Generation'].isna()]
df_raw = df_raw[~df_raw['Generation'].isna()]
# reduce to month
same_month = df_raw[df_raw.index.month.isin(df_Na.index.month)]
# reduce to same day
same_day = same_month[same_month.index.day.isin(df_Na.index.day)]
# reduce to hour
same_hour = same_day[same_day.index.hour.isin(df_Na.index.hour)]
df_Na are all missing values I liked to fill and df_raw are all reported values from which I liked to get the mean value. I have a huge dataset which is why I would like to avoid a for loop at all cost.
My Data looks like this:
df_Na
Generation
2017-12-02 19:00:00 NaN
2021-01-12 00:00:00 NaN
2021-01-12 01:00:00 NaN
..............................
2021-02-12 20:00:00 NaN
2021-02-12 21:00:00 NaN
2021-02-12 22:00:00 NaN
df_raw
Generation
2015-09-12 00:00:00 0.0
2015-09-12 01:00:00 19.0
2015-09-12 02:00:00 0.0
..............................
2021-12-11 21:00:00 0.0
2021-12-11 22:00:00 180.0
2021-12-11 23:00:00 0.0
Use GroupBy.transform with mean for averages per MM-DD HH and replace missing values by DataFrame.fillna:
df = df.fillna(df.groupby(df.index.strftime('%m-%d %H')).transform('mean'))
And then if necessary add DataFrame.interpolate:
df = df.interpolate(method='nearest')

Pandas fill not all nan in 2 concated date frames with different timestamp

I have 2 data frames one with frequent entries. I would like to concat them and fill NaN in less frequent last entry, but if the last entry was NaN, I would like to fill with NaN
Example:
df = pd.DataFrame(data=[4.5, 4.6, 5.7, 5.7, 6.7, 4, 9.0],
index=list(map(pd.to_datetime, ['00:00', '00:30', '01:00', '01:30', '02:00', '02:30', '03:00'])),
columns=['frequent data'])
df2 = pd.DataFrame(data=[4.5, np.NaN, 5.7, np.NaN],
index=list(map(pd.to_datetime, ['00:00', '01:00', '02:00', '03:00'])),
columns=['data'])
df2
frequent data data
2022-01-15 00:00:00 4.5 4.5
2022-01-15 01:00:00 5.7 NaN
2022-01-15 02:00:00 6.7 5.7
2022-01-15 03:00:00 9.0 NaN
new_df = pd.concat((df, df2), axis=1)
new_df
frequent data data
2022-01-15 00:00:00 4.5 4.5
2022-01-15 00:30:00 4.6 NaN
2022-01-15 01:00:00 5.7 NaN
2022-01-15 01:30:00 5.7 NaN
2022-01-15 02:00:00 6.7 5.7
2022-01-15 02:30:00 4.0 NaN
2022-01-15 03:00:00 9.0 NaN
I would like to achieve such a date frame
frequent data data
2022-01-15 00:00:00 4.5 4.5
2022-01-15 00:30:00 4.6 4.5
2022-01-15 01:00:00 5.7 NaN
2022-01-15 01:30:00 5.7 NaN
2022-01-15 02:00:00 6.7 5.7
2022-01-15 02:30:00 4.0 5.7
2022-01-15 03:00:00 9.0 NaN
Is there any easy way for this or do I need to write my function for this?
IIUC:
df2 = df2.reindex(df.index).groupby(lambda x: x.floor('H')).ffill()
new_df = pd.concat([df, df2], axis=1)
print(new_df)
# Output
frequent data data
2022-01-15 00:00:00 4.5 4.5
2022-01-15 00:30:00 4.6 4.5
2022-01-15 01:00:00 5.7 NaN
2022-01-15 01:30:00 5.7 NaN
2022-01-15 02:00:00 6.7 5.7
2022-01-15 02:30:00 4.0 5.7
2022-01-15 03:00:00 9.0 NaN
You can also fillna after concat:
new_df = pd.concat([df, df2], axis=1).groupby(lambda x: x.floor('H')).ffill()

multi index(time series) slicing error in pandas

i have below dataframe. date/time is multi-indexed indexes.
when i doing this code,
<code>
idx = pd.IndexSlice
print(df_per_wday_temp.loc[idx[:,datetime.time(4, 0, 0): datetime.time(7, 0, 0)]])"
but i got error 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (2), lexsort depth (1)'. this may be error in
index slicing but i don't know why this happened. anybody can solve it ?
a b
date time
2018-01-26 19:00:00 25.08 -7.85
19:15:00 24.86 -7.81
19:30:00 24.67 -8.24
19:45:00 NaN -9.32
20:00:00 NaN -8.29
20:15:00 NaN -8.58
20:30:00 NaN -9.48
20:45:00 NaN -8.73
21:00:00 NaN -8.60
21:15:00 NaN -8.70
21:30:00 NaN -8.53
21:45:00 NaN -8.90
22:00:00 NaN -8.55
22:15:00 NaN -8.48
22:30:00 NaN -9.90
22:45:00 NaN -9.70
23:00:00 NaN -8.98
23:15:00 NaN -9.17
23:30:00 NaN -9.07
23:45:00 NaN -9.45
00:00:00 NaN -9.64
00:15:00 NaN -10.08
00:30:00 NaN -8.87
00:45:00 NaN -9.91
01:00:00 NaN -9.91
01:15:00 NaN -9.93
01:30:00 NaN -9.55
01:45:00 NaN -9.51
02:00:00 NaN -9.75
02:15:00 NaN -9.44
... ... ...
03:45:00 NaN -9.28
04:00:00 NaN -9.96
04:15:00 NaN -10.19
04:30:00 NaN -10.20
04:45:00 NaN -9.85
05:00:00 NaN -10.33
05:15:00 NaN -10.18
05:30:00 NaN -10.81
05:45:00 NaN -10.51
06:00:00 NaN -10.41
06:15:00 NaN -10.49
06:30:00 NaN -10.13
06:45:00 NaN -10.36
07:00:00 NaN -10.71
07:15:00 NaN -12.11
07:30:00 NaN -10.76
07:45:00 NaN -10.76
08:00:00 NaN -11.63
08:15:00 NaN -11.18
08:30:00 NaN -10.49
08:45:00 NaN -11.18
09:00:00 NaN -10.67
09:15:00 NaN -10.60
09:30:00 NaN -10.36
09:45:00 NaN -9.39
10:00:00 NaN -9.77
10:15:00 NaN -9.54
10:30:00 NaN -8.99
10:45:00 NaN -9.01
11:00:00 NaN -10.01
thanks in advance
If is not possible sorting index, is necessary create boolean mask and filter by boolean indexing:
from datetime import time
mask = df1.index.get_level_values(1).to_series().between(time(4, 0, 0), time(7, 0, 0)).values
df = df1[mask]
print (df)
a b
date time
2018-01-26 04:00:00 NaN -9.96
04:15:00 NaN -10.19
04:30:00 NaN -10.20
04:45:00 NaN -9.85
05:00:00 NaN -10.33
05:15:00 NaN -10.18
05:30:00 NaN -10.81

Pandas Set on copy warning when using .loc

I'm trying to change the values in a column of a dataframe based on a condition.
In [1]:df.head()
Out[2]: gen cont
timestamp
2012-07-01 00:00:00 0.293 0
2012-07-01 00:30:00 0.315 0
2012-07-01 01:00:00 0.0 0
2012-07-01 01:30:00 0.005 0
2012-07-01 02:00:00 0.231 0
I want to set the 'gen' column to NaN whenever the sum of the 2 columns is below a threshold of 0.01, so what I want is this:
In [1]:df.head()
Out[2]: gen cont
timestamp
2012-07-01 00:00:00 0.293 0
2012-07-01 00:30:00 0.315 0
2012-07-01 01:00:00 NaN 0
2012-07-01 01:30:00 NaN 0
2012-07-01 02:00:00 0.231 0
I have used this:
df.loc[df.gen + df.con <0.01 ,'gen'] = np.nan
It gives me the result I want but with the warning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
I am confused because I am using .loc and I think I'm using it in the way suggested.
For me your solution works nice.
Alternative solution with mask, it by default add NaN if condition True:
df['gen'] = df['gen'].mask(df['gen'] + df['cont'] < 0.01)
print (df)
timestamp gen cont
0 2012-07-01 00:00:00 0.293 0
1 2012-07-01 00:30:00 0.315 0
2 2012-07-01 01:00:00 NaN 0
3 2012-07-01 01:30:00 NaN 0
4 2012-07-01 02:00:00 0.231 0
EDIT:
You need copy.
If you modify values in df later you will find that the modifications do not propagate back to the original data (df_in), and that Pandas does warning.
df = df_in.loc[sDate:eDate].copy()