Pandas resample only when makes sense - pandas

I have a time series that is very irregular. The difference in time, between two records can be 1s or 10 days.
I want to resample the data every 1h, but only when the sequential records are less than 1h.
How to approach this, without making too many loops?
In the example above, I would like to resample only rows 5-6 (delta difference is 10s) and rows 6-7 (delta difference is 50min).
The others should remain as they are.
tmp=vals[['datumtijd','filter data']]
datumtijd filter data
0 1970-11-01 00:00:00 129.0
1 1970-12-01 00:00:00 143.0
2 1971-01-05 00:00:00 151.0
3 1971-02-01 00:00:00 151.0
4 1971-03-01 00:00:00 163.0
5 1971-03-01 00:00:10 163.0
6 1971-03-01 00:00:20 163.0
7 1971-03-01 00:01:10 163.0
8 1971-03-01 00:04:10 163.0
.. ... ...
244 1981-08-19 00:00:00 102.0
245 1981-09-02 00:00:00 98.0
246 1981-09-17 00:00:00 92.0
247 1981-10-01 00:00:00 89.0
248 1981-10-19 00:00:00 92.0

You can be a little explicit about this by using groupby on the hour-floor of the time stamps:
grouped = df.groupby(df['datumtijd'].dt.floor('1H')).mean()
This is explicitly looking for the hour of each existing data point and grouping the matching ones.
But you can also just do the resample and then filter out the empty data, as pandas can still do this pretty quickly:
resampled = df.resample('1H', on='datumtijd').mean().dropna()
In either case, you get the following (note that I changed the last time stamp just so that the console would show the hours):
filter data
datumtijd
1970-11-01 00:00:00 129.0
1970-12-01 00:00:00 143.0
1971-01-05 00:00:00 151.0
1971-02-01 00:00:00 151.0
1971-03-01 00:00:00 163.0
1981-08-19 00:00:00 102.0
1981-09-02 00:00:00 98.0
1981-09-17 00:00:00 92.0
1981-10-01 00:00:00 89.0
1981-10-19 03:00:00 92.0
One quick clarification also. In your example, rows 5-8 all occur within the same hour, so they all get grouped together (hour:minute:second)!.
Also, see this related post.

Related

Get the value in a dataframe based on a value and a date in another dataframe

I tried countless answers to similar problems here on SO but couldn't find anything that works for this scenario. It's driving me nuts.
I have these two Dataframes:
df_op:
index
Date
Close
Name
LogRet
0
2022-11-29 00:00:00
240.33
MSFT
-0.0059
1
2022-11-29 00:00:00
280.57
QQQ
-0.0076
2
2022-12-13 00:00:00
342.46
ADBE
0.0126
3
2022-12-13 00:00:00
256.92
MSFT
0.0173
df_quotes:
index
Date
Close
Name
72
2022-11-29 00:00:00
141.17
AAPL
196
2022-11-29 00:00:00
240.33
MSFT
73
2022-11-30 00:00:00
148.03
AAPL
197
2022-11-30 00:00:00
255.14
MSFT
11
2022-11-30 00:00:00
293.36
QQQ
136
2022-12-01 00:00:00
344.11
ADBE
198
2022-12-01 00:00:00
254.69
MSFT
12
2022-12-02 00:00:00
293.72
QQQ
I would like to add a column to df_op that indicates the close of the stock in df_quotes 2 days later. For example, the first row of df_op should become:
index
Date
Close
Name
LogRet
Next
0
2022-11-29 00:00:00
240.33
MSFT
-0.0059
254.69
In other words:
for each row in df_op, find the corresponding Name in df_quotes with Date of 2 days later and copy its Close to df_op in column 'Next'.
I tried tens of combinations like this without success:
df_quotes[df_quotes['Date'].isin(df_op['Date'] + pd.DateOffset(days=2)) & df_quotes['Name'].isin(df_op['Name'])]
How can I do this without recurring to loops?
Try this:
#first convert to datetime
df_op['Date'] = pd.to_datetime(df_op['Date'])
df_quotes['Date'] = pd.to_datetime(df_quotes['Date'])
#merge on Date and Name, but the date is offset 2 business days
(pd.merge(df_op,
df_quotes[['Date','Close','Name']].rename({'Close':'Next'},axis=1),
left_on=['Date','Name'],
right_on=[df_quotes['Date'] - pd.tseries.offsets.BDay(2),'Name'],
how = 'left')
.drop(['Date_x','Date_y'],axis=1))
Output:
Date index Close Name LogRet Next
0 2022-11-29 0 240.33 MSFT -0.0059 254.69
1 2022-11-29 1 280.57 QQQ -0.0076 NaN
2 2022-12-13 2 342.46 ADBE 0.0126 NaN
3 2022-12-13 3 256.92 MSFT 0.0173 NaN

Interpolate hourly load of a selected months of a year from the same months of the previous year and the next year in python pandas?

I have the following three dataframes:
df1:
date_time system_load
01-01-2017 00:00:00 208111
01-01-2017 01:00:00 208311
01-01-2017 02:00:00 208311
01-01-2017 03:00:00 208011
............... ...
31-12-2017 20:00:00 208611
31-12-2017 21:00:00 208411
31-12-2017 22:00:00 208111
31-12-2017 23:00:00 208911
The system load values of df1 has no problem.
df2:
date_time system_load
01-01-2018 00:00:00 208111
01-01-2018 01:00:00 208311
01-01-2018 02:00:00 208311
01-01-2018 03:00:00 208011
............... ...
31-12-2018 20:00:00 209611
31-12-2018 21:00:00 209411
31-12-2018 22:00:00 209111
31-12-2018 23:00:00 209911
The system load values of df2 is missed starting from 06-03-2018 20:00:00 till up to 24-10-2018 22:00:00.
df3:
date_time system_load
01-01-2019 00:00:00 309119
01-01-2019 01:00:00 309391
01-01-2019 02:00:00 309811
01-01-2019 03:00:00 309711
............... ...
31-12-2019 20:00:00 309611
31-12-2019 21:00:00 309411
31-12-2019 22:00:00 309111
31-12-2019 23:00:00 309911
The system load values of df3 has no problem.
What I want is to interpolate in suitable way the missed hourly records in df2 using the corresponding df1 and df3 hourly records (06-03-2017 20:00:00 till up to 24-10-2017 22:00:00 and 06-03-2019 20:00:00 till up to 24-10-2019 22:00:00 respectively). Based on "Pierre D"'s valuable comment I attached my scaled data.
Here is a very basic strategy that just takes data from neighboring years to fill the missing values. The offset is chosen to be precisely 52 weeks, so as to reflect possible weekly seasonality.
# get the whole series together, and resample to have missing data as NaN:
s = pd.concat([df1, df2, df3])['system_load'].resample('H').asfreq()
offset = 52 * 7 * 24 # 52 weeks, 7 days/week, 24 hours/day
filler = pd.concat([s.shift(offset), s.shift(-offset)], axis=1).mean(axis=1)
out = s.where(~s.isna(), filler)
# optional: make a new df2 with the filled values
df2mod = out.truncate(
before='2018',
after=pd.Timestamp('2019') - pd.Timedelta(1)
).to_frame('system_load')
Notes:
out contains the "filled" series for the whole system_load using neighboring years.
we use pandas.DataFrame.mean() to build the filler series as the mean of the two neighboring years, in a way that takes care of NaN (e.g. if one year or the other has NaN, then the mean is the only non-NaN value).
this is one of the most basic ways of filling the missing data, and likely won't fool a careful observer. Depending on the intended usage of the reconstructed data, a more elaborate strategy should be considered. Data reconstruction is an active field of research, and there are sophisticated methods in the literature. For example, one could use a GAN to build a resulting series that would be very hard to discriminate from real data.

pandas groupby several criteria

I have a dataframe that looks like this
which contains every minute of a year.
I need to simplify it on hourly base and to get only hours of the year and then maximum of Reserved and Used columns for the respective hours.
I made this, which works, but not totally for my purposes
df = df.assign(date=df.date.dt.round('H'))
df1 = df.groupby('date').agg({'Reserved': ['max'], 'Used': ['max'] }).droplevel(1, axis=1).reset_index()
which just groups the minutes into hours.
date Reserved Used
0 2020-01-01 00:00:00 2176 0.0
1 2020-01-01 01:00:00 2176 0.0
2 2020-01-01 02:00:00 2176 0.0
3 2020-01-01 03:00:00 2176 0.0
4 2020-01-01 04:00:00 2176 0.0
... ... ... ...
8780 2020-12-31 20:00:00 3450 50.0
8781 2020-12-31 21:00:00 3450 0.0
8782 2020-12-31 22:00:00 3450 0.0
8783 2020-12-31 23:00:00 3450 0.0
8784 2021-01-01 00:00:00 3450 0.0
Now I need group it more to plot several curves, containing only 24 points (for every hour) based on several criteria
average used and reserved for the whole year (so to group together every 00 hour, every 01 hour, etc.)
average used and reserved for every month (so to group every 00 hour, 01 hour etc for each month individually)
average used and reserved for weekdays and for weekends
I know this is only the similar groupby as before, but I somehow miss the logic of doing it.
Could anybody help?
Thanks.

How to fill pandas dataframe with max() values

I have a dataframe where each day starts at 7:00 and ends at 22:10 in 5 minute intervals.
In the df are around 200 days (weekend days and some specific days are excluded)
Date Time Volume
0 2019-09-03 07:00:00 70000 778
1 2019-09-03 07:05:00 70500 1267
2 2019-09-03 07:10:00 71000 1208
3 2019-09-03 07:15:00 71500 715
4 2019-09-03 07:20:00 72000 372
I need another column, let's call it 'lastdayVolume', with the max value of Volume of the prior day
For example, in 2019-09-03 (between 7:00 and 22:10) the maximum volume value in a single row is 50000, then I need in every row of 2019-09-04 the value 50000 in column 'lastdayVolume'.
How would you do this without decreasing the lenght of the dataframe?
Have you tried
df.resample('1D', on='Date').max()
This should give you one row per day with the maximal value at this day.
EDIT: To combine that with the old Data, you can use a left join. Its a bit messy but
pd.merge(df, df.resample('1D', on='Date')['Volume'].max().rename('lastdayVolume'), left_on=pd.to_datetime((df['Date'] - pd.Timedelta('1d')).dt.date), right_index=True, how='left')
In [54]: pd.merge(df, df.resample('1D', on='Date')['Volume'].max().rename('lastdayVolume'), left_on=pd.to_datetime((df['Date'] - pd.Timedelta('1d')).dt.date), right_index=True, how='left')
Out[54]:
Date Time Volume lastdayVolume
0 2019-09-03 07:00:00 70000 778 800.0
1 2019-09-03 07:05:00 70500 1267 800.0
2 2019-09-03 07:10:00 71000 1208 800.0
3 2019-09-03 07:15:00 71500 715 800.0
4 2019-09-03 07:20:00 72000 372 800.0
0 2019-09-02 08:00:00 70000 800 NaN
seems to work out.
Equivalently you can use the slightly shorter
df.join(df.resample('1D', on='Date')['Volume'].max().rename('lastdayVolume'), on=pd.to_datetime((df['Date'] - pd.Timedelta('1d')).dt.date))
here.
The first DataFrame is your old one, the second is the one I calculated above (with appropriate renaming). For the values to merge on you use your 'Date' column which contains timestamps, offset it by one day and converted to an actual date on the left. On the right simply use the index.
The left join ensures you don't accidentally drop rows if you have no transactions the day before.
EDIT 2: To find out that maximum in a certain timerange, you can use
df.set_index('Date').between_time('15:30:00', '22:10:00')
to filter the DataFrame. Afterwards resample as before
df.join(df.set_index('Date').between_time('15:30:00', '22:10:00').resample('1D')...
where the on parameter in the resample is no longer necessary as the Date went into the index.

Pandas take daily mean within resampled date

I have a dataframe with trip counts every 20 minutes during a whole month, let's say:
Date Trip count
0 2019-08-01 00:00:00 3
1 2019-08-01 00:20:00 2
2 2019-08-01 00:40:00 4
3 2019-08-02 00:00:00 6
4 2019-08-02 00:20:00 4
5 2019-08-02 00:40:00 2
I want to take daily mean of all trip counts every 20 minutes. Desired output (for above values) looks like:
Date mean
0 00:00:00 4.5
1 00:20:00 3
2 00:40:00 3
..
72 23:40:00 ..
You can aggregate by times created by Series.dt.time, because there are always 00, 20, 40 minutes only and no seconds:
df['Date'] = pd.to_datetime(df['Date'])
df1 = df.groupby(df['Date'].dt.time).mean()
#alternative
#df1 = df.groupby(df['Date'].dt.strftime('%H:%M:%S')).mean()
print (df1)
Trip count
Date
00:00:00 4.5
00:20:00 3.0
00:40:00 3.0