Pandas add column based on grouped by rolling average - pandas

I have successfully added a new summed Volume column using Transform when grouping by Date like so:
df
Name Date Volume
--------------------------
APL 12-01-2017 1102
BSC 12-01-2017 4500
CDF 12-02-2017 5455
df['vol_all_daily'] = df['Volume'].groupby([df['Date']]).transform('sum')
Name Date Volume vol_all_daily
------------------------------------------
APL 12-01-2017 1102 5602
BSC 12-01-2017 4500 5602
CDF 12-02-2017 5455 5455
However when I want to take the rolling average it doesn't work!
df['vol_all_ma_2']=df['vol_all_daily'].
groupby([df['Date']]).rolling(window=2).mean()
Returns a DataGroupBy that gives error *and becomes too hard to put back into a df column anyways.
df['vol_all_ma_2'] =
df['vol_all_daily'].groupby([df['Date']]).transform('mean').
rolling(window=2).mean()
This just produces near identical result of vol_all_daily column
Update:
I wasn't taking the just one column per date..The above code will still take multiple dates...Instead I add the .first() to the groupby..Not sure why groupby isnt taking one row per date.

The behavior of what you have written seems correct (Part 1 below), but perhaps you want to be calling something different (Part 2 below).
Part 1: Why what you have written is behaving correctly:
d = {'Name':['APL', 'BSC', 'CDF'],'Date':pd.DatetimeIndex(['2017-12-01', '2017-12-01', '2017-12-02']),'Volume':[1102,4500,5455]}
df = pd.DataFrame(d)
df['vol_all_daily'] = df['Volume'].groupby([df['Date']]).transform('sum')
print(df)
rolling_vol = df['vol_all_daily'].groupby([df['Date']]).rolling(window=2).mean()
print('')
print(rolling_vol)
I get as output:
Date Name Volume vol_all_daily
0 2017-12-01 APL 1102 5602
1 2017-12-01 BSC 4500 5602
2 2017-12-02 CDF 5455 5455
Date
2017-12-01 0 NaN
1 5602.0
2017-12-02 2 NaN
Name: vol_all_daily, dtype: float64
To understand why this result rolling_vol is correct, notice that you have first called the groupby, and only after that you have called rolling. That should not produce something that fits with df.
Part 2: What I think you wanted to call (just a rolling average):
If you instead run:
# same as above but without groupby
rolling_vol2 = df['vol_all_daily'].rolling(window=2).mean()
print('')
print(rolling_vol2)
You should get:
0 NaN
1 5602.0
2 5528.5
Name: vol_all_daily, dtype: float64
which looks more like the rolling average you seem to want. To explain that, I suggest reading the details of pandas resampling vs rolling.

Related

Pandas add row to datetime indexed dataframe

I cannot find a solution for this problem. I would like to add future dates to a datetime indexed Pandas dataframe for model prediction purposes.
Here is where I am right now:
new_datetime = df2.index[-1:] # current end of datetime index
increment = '1 days' # string for increment - eventually will be in a for loop to add add'l days
new_datetime = new_datetime+pd.Timedelta(increment)
And this is where I am stuck. The append examples online only seem always seem to show examples with ignore_index=True , and in my case, I want to use the proper datetime indexing.
Suppose you have this df:
date value
0 2020-01-31 00:00:00 1
1 2020-02-01 00:00:00 2
2 2020-02-02 00:00:00 3
then an alternative for adding future days is
df.append(pd.DataFrame({'date': pd.date_range(start=df.date.iloc[-1], periods=6, freq='D', closed='right')}))
which returns
date value
0 2020-01-31 00:00:00 1.0
1 2020-02-01 00:00:00 2.0
2 2020-02-02 00:00:00 3.0
0 2020-02-03 00:00:00 NaN
1 2020-02-04 00:00:00 NaN
2 2020-02-05 00:00:00 NaN
3 2020-02-06 00:00:00 NaN
4 2020-02-07 00:00:00 NaN
where the frequency is D (days) day and the period is 6 days.
I think I was making this more difficult than necessary because I was using a datetime index instead of the typical integer index. By leaving the 'date' field as a regular column instead of an index adding the rows is straightforward.
One thing I did do was add a reindex command so I did not end up with wonky duplicate index values:
df = df.append(pd.DataFrame({'date': pd.date_range(start=df.date.iloc[-1], periods=21, freq='D', closed='right')}))
df = df.reset_index() # resets index
i also needed this and i solve merging the code that you share with the code on this other response add to a dataframe as I go with datetime index and end out with the following code that work for me.
data=raw.copy()
new_datetime = data.index[-1:] # current end of datetime index
increment = '1 days' # string for increment - eventually will be in a for loop to add add'l days
new_datetime = new_datetime+pd.Timedelta(increment)
today_df = pd.DataFrame({'value': 301.124},index=new_datetime)
data = data.append(today_df)
data.tail()
here 'value' is the header of your own dataframe

Datetime reformat weekly column

I split a dataframe from minute to daily, weekly and monthly. I had no problem to reformat the daily dataframe, though I am having a good bit of trouble trying to do the same with the weekly one. Here is the weekly dataframe if someone could help me out please, it would be great. I am adding the code I used to reformat the daily dataframe, so it may helps!
I am plotting it with Bokeh and without the datetime format I won't be able to format the axis and hovertools as I would like.
Thanks beforehand.
dfDay1 = dfDay.loc['2014-01-01':'2020-09-31']
dfDay1 = dfDay1.reset_index()
dfDay1['date1'] = pd.to_datetime(dfDay1['date'], format=('%Y/%m/%d'))
dfDay1 = dfDay1.set_index('date')
That worked fine for the day format.
If need convert dates before / use Series.str.split with str[0], if dates after / use str[1]:
df['date1'] = pd.to_datetime(df['week'].str.split('/').str[0])
print (df)
week Open Low High Close Volume \
0 2014-01-07/2014-01-13 58.1500 55.38 58.96 56.0000 324133239
1 2014-01-14/2014-01-20 56.3500 55.96 58.57 56.2500 141255151
2 2014-01-21/2014-01-27 57.8786 51.85 59.31 52.8600 279370121
3 2014-01-28/2014-02-03 53.7700 52.75 63.95 62.4900 447186604
4 2014-02-04/2014-02-10 62.8900 60.45 64.90 63.9100 238316161
.. ... ... ... ... ... ...
347 2020-09-01/2020-09-07 297.4000 271.14 303.90 281.5962 98978386
348 2020-09-08/2020-09-14 275.0000 262.64 281.40 271.0100 109717114
349 2020-09-15/2020-09-21 272.6300 244.13 274.52 248.5800 123816172
350 2020-09-22/2020-09-28 254.3900 245.40 259.98 255.8800 98550687
351 2020-09-29/2020-10-05 258.2530 256.50 268.33 261.3500 81921670
date1
0 2014-01-07
1 2014-01-14
2 2014-01-21
3 2014-01-28
4 2014-02-04
.. ...
347 2020-09-01
348 2020-09-08
349 2020-09-15
350 2020-09-22
351 2020-09-29
[352 rows x 7 columns]

Time column interval filter

I have a dataframe with a "Fecha" column, I would like to reduce de Dataframe size through filter it and maintain just the rows which are on each 10 minutes multiple and discard all rows which are not in 10 minutes multiple.
Some idea?
Thanks
I have to guess some variable names. But assuming your dataframe name is df, the solution should look similar to:
df['Fecha'] = pd.to_datetime(df['Fecha'])
df = df[df['Fecha'].minute % 10 == 0]
The first line guarantees that your 'Fecha' column is in DateTime-Format. The second line filters all rows which are a multiple of 10 minutes. To do this you use the modulus operator %.
Since I'm not sure if this solves your problem, here's a minimal example that runs by itself:
import pandas as pd
idx = pd.date_range(pd.Timestamp(2020, 1, 1), periods=60, freq='1T')
series = pd.Series(1, index=idx)
series = series[series.index.minute % 10 == 0]
series
The first three lines construct a series with a 1 minute index, which is filtered in the fourth line.
Output:
2020-01-01 00:00:00 1
2020-01-01 00:10:00 1
2020-01-01 00:20:00 1
2020-01-01 00:30:00 1
2020-01-01 00:40:00 1
2020-01-01 00:50:00 1
dtype: int64

How to use groupby for multiple columns in pandas for the below shown image?

This is input table in pandas:
this is an output table as shown below:
dtype: int64
Dear Friends,
I am new to pandas, how to get the result is shown in the second image using pandas.
I am getting output as shown below using this approach
"df.groupby(['Months', 'Status']).size()"
Months Status
Apr-20 IW 2
OW 1
Jun-20 IW 4
OW 4
May-20 IW 3
OW 2
dtype: int64
But how to convert this output as shown in the second image?
It will be more helpful if someone is able to help me. Thanks in advance.
Use crosstab with margins=True parameter, then if necessary remove last Total column, change order of columns by DataFrame.reindex with ordering of original column and last convert index to column by DataFrame.reset_index and remove columns names by DataFrame.rename_axis:
df = (pd.crosstab(df['Status'], df['Months'], margins_name='Total', margins=True)
.iloc[:, :-1]
.reindex(df['Months'].unique(), axis=1)
.reset_index()
.rename_axis(None, axis=1))
print (df)
Status Apr_20 May_20 Jun_20
0 IW 4 2 4
1 OW 1 2 4
2 Total 5 4 8
Unstack, and then transpose:
df = df.groupby(['Months', 'Status']).size().unstack().T
To get a total row:
df.sum().rename('Total').to_frame().T.append(df)

Pandas Resample Strange Zero Tolerance Behavior

I'm attempting to resample a time series in Pandas and I am getting some odd behavior:
print samples[196].base_df.to_string()
Units Sales
2008-07-03 3 820.00
2008-07-04 3 470.00
...
2010-06-22 1 335.00
2010-06-24 2 180.00
2010-06-30 -1 -2502.00
print samples[196].base_df.resample('15d', how='sum')
Units Sales
2008-07-03 17 3.149130e+03
2008-07-18 29 6.305210e+03
...
2010-06-08 18 5.204000e+03
2010-06-23 1 -2.322000e+03
2010-07-08 0 6.521324e-312
I would have expected the last value in the resampled series to be either zero or omitted. Is this expected behavior for the resample function? If helpful I can post the full time series, but it is a bit long...