Groupby month parameter in Multi-level Index in pandas - pandas

I have a large DF which is structured like this. It has multiple stocks in level 0 and Date is level 1. Starts monthly data at 12/31/2004 and continues to 12/31/2017 (not shown).
Date DAILY_RETURN
A 12/31/2004 NaN
1/31/2005 -8.26
2/28/2005 8.55
3/31/2005 -7.5
4/29/2005 -6.53
5/31/2005 15.71
6/30/2005 -4.12
7/29/2005 13.99
8/31/2005 22.56
9/30/2005 1.83
10/31/2005 -2.26
11/30/2005 11.4
12/30/2005 -6.65
1/31/2006 1.86
2/28/2006 6.16
3/31/2006 4.31
What I want to do is groupby the month and then count the number of POSITIVE returns in the daily_returns by month (ie 01, then 02, 03, etc from the Date part of the index). This code will give me the count but only by index level=0.
df3.groupby(level=0)['DAILY_RETURN'].agg(['count'])
There are other question out there, this one being the closest but I can not get the code to work. Can someone help out. Ultimately what I want to do is groupby stock and then month and FILTER all stocks that have at least 70% positive returns by month. I cant seem to figure out how to get the positive return from the dataframe either
How to group pandas DataFrame entries by date in a non-unique column

Here it is for a smaller data, using datetime
import pandas as pd
from datetime import datetime
df = pd.DataFrame()
df['Date'] = ['12/31/2004', '1/31/2005', '12/31/2005', '2/28/2006', '2/28/2007']
df['DAILY_RETURN'] = [-8, 9, 5, 10, 14]
df = df[df.DAILY_RETURN > 0]
df['Date_obj'] = df['Date'].apply(lambda x: datetime.strptime(x, '%m/%d/%Y').month)
df.groupby('Date_obj').count()[['DAILY_RETURN']]

Related

Pandas reindex Dates To Subset of Dates from List

I am sorry, but there is online documentation and examples and I'm still not understanding. I have a pandas df with an index of dates in datetime format (yyyy-mm-dd) and I'm trying to resample or reindex this dataframe based on a subset of dates in the same format (yyyy-mm-dd) that are in a list. I have converted the df.index values to datetime using:
dfmla.index = pd.to_datetime(dfmla.index)
I've tried various things and I keep getting NaN's after applying the reindex. I know this must be a datatypes problem and my df is in the form of:
df.dtypes
Out[30]:
month int64
mean_mon_flow float64
std_mon_flow float64
monthly_flow_ln float64
std_anomaly float64
dtype: object
My data looks like this:
df.head(5)
Out[31]:
month mean_mon_flow std_mon_flow monthly_flow_ln std_anomaly
date
1949-10-01 10 8.565828 0.216126 8.848631 1.308506
1949-11-01 11 8.598055 0.260254 8.368006 -0.883938
1949-12-01 12 8.612080 0.301156 8.384662 -0.755149
1950-08-01 8 8.614236 0.310865 8.173776 -1.416887
1950-09-01 9 8.663943 0.351730 8.437089 -0.644967
My month_list (list datatype) looks like this:
month_list[0:2]
Out[37]: ['1950-08-01', '1950-09-01']
I need my condensed, new reindexed df to look like this:
month mean_mon_flow std_mon_flow monthly_flow_ln std_anomaly
date
1950-08-01 8 8.614236 0.310865 8.173776 -1.416887
1950-09-01 9 8.663943 0.351730 8.437089 -0.644967
thank you for your suggestions,
If you're certain that all month_list are in the index, you can do df.loc[month_list], else you can use reindex:
df.reindex(pd.to_datetime(month_list))
Output:
month mean_mon_flow std_mon_flow monthly_flow_ln std_anomaly
date
1950-08-01 8 8.614236 0.310865 8.173776 -1.416887
1950-09-01 9 8.663943 0.351730 8.437089 -0.644967

Groupby two columns one of them is datetime

I have data frame that I want to groupby by two columns one of them is datetime type. How can I do this?
import pandas as pd
import datetime as dt
df = pd.DataFrame({
'a':np.random.randn(6),
'b':np.random.choice( [5,7,np.nan], 6),
'g':{1002,300,1002,300,1002,300}
'c':np.random.choice( ['panda','python','shark'], 6),
# some ways to create systematic groups for indexing or groupby
# this is similar to r's expand.grid(), see note 2 below
'd':np.repeat( range(3), 2 ),
'e':np.tile( range(2), 3 ),
# a date range and set of random dates
'f':pd.date_range('1/1/2011', periods=6, freq='D'),
'g':np.random.choice( pd.date_range('1/1/2011', periods=365,
freq='D'), 6, replace=False)
})
You can use pd.Grouper to specify groupby instructions. It can be used with pd.DatetimeIndex index to group data with specified frequency using the freq parameter.
Assumming that you have this dataframe:
df = pd.DataFrame(dict(
a=dict(date=pd.Timestamp('2020-05-01'), category='a', value=1),
b=dict(date=pd.Timestamp('2020-06-01'), category='a', value=2),
c=dict(date=pd.Timestamp('2020-06-01'), category='b', value=6),
d=dict(date=pd.Timestamp('2020-07-01'), category='a', value=1),
e=dict(date=pd.Timestamp('2020-07-27'), category='a', value=3),
)).T
You can set index to date column and it would be converted to pd.DatetimeIndex. Then you can use pd.Grouper among with another columns. For the following example I use category column.
freq='M' parameter used to group index using month frequency. There are number of string data series aliases that can be used in pd.Grouper
df.set_index('date').groupby([pd.Grouper(freq='M'), 'category'])['value'].sum()
Result:
date category
2020-05-31 a 1
2020-06-30 a 2
b 6
2020-07-31 a 4
Name: value, dtype: int64
Another example with your mcve:
df.set_index('g').groupby([pd.Grouper(freq='M'), 'c']).d.sum()
Result:
g c
2011-01-31 panda 0
2011-04-30 shark 2
2011-06-30 panda 2
2011-07-31 panda 0
2011-09-30 panda 1
2011-12-31 python 1
Name: d, dtype: int32

How to move the timestamp bounds for datetime in pandas (working with historical data)?

I'm working with historical data, and have some very old dates that are outside the timestamp bounds for pandas. I've consulted the Pandas Time series/date functionality documentation, which has some information on out of bounds spans, but from this information, it still wasn't clear to me what, if anything I could do to convert my data into a datetime type.
I've also seen a few threads on Stack Overflow on this, but they either just point out the problem (i.e. nanoseconds, max range 570-something years), or suggest setting errors = coerce which turns 80% of my data into NaTs.
Is it possible to turn dates lower than the default Pandas lower bound into dates? Here's a sample of my data:
import pandas as pd
df = pd.DataFrame({'id': ['836', '655', '508', '793', '970', '1075', '1119', '969', '1166', '893'],
'date': ['1671-11-25', '1669-11-22', '1666-05-15','1673-01-18','1675-05-07','1677-02-08','1678-02-08', '1675-02-15', '1678-11-28', '1673-12-23']})
You can create day periods by lambda function:
df['date'] = df['date'].apply(lambda x: pd.Period(x, freq='D'))
Or like mentioned #Erfan in comment (thank you):
df['date'] = df['date'].apply(pd.Period)
print (df)
id date
0 836 1671-11-25
1 655 1669-11-22
2 508 1666-05-15
3 793 1673-01-18
4 970 1675-05-07
5 1075 1677-02-08
6 1119 1678-02-08
7 969 1675-02-15
8 1166 1678-11-28
9 893 1673-12-23

monthly frequency time series data frame, fill NaNs with specific values

How do I pass values to months from April to September.
I would like the April value equals to 42000, May=41000, June=61200, July=71000,August=71000
df.index
RangeIndex(start=0, stop=60, step=1)
For a mapping like this, you would typically define a dictionary and map the values. Use .split to get the month part of the date and fillna to fill only the missing values.
Data:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Date': ['2018-Jan', '2018-Feb', '2018-Mar', '2018-Apr', '2018-May',
'2018-Jun', '2018-Jul', '2018-Aug', '2018-Sep'],
'Value': [75267.169, 42258.868, 43793]+[np.NaN]*6})
Code:
d = {'Apr': 42000, 'May': 41000, 'Jun': 61200, 'Jul': 71000, 'Aug': 71000}
df['Value'] = df.Value.fillna(df.Date.str.split('-').str[1].map(d))
Output:
Date Value
0 2018-Jan 75267.169
1 2018-Feb 42258.868
2 2018-Mar 43793.000
3 2018-Apr 42000.000
4 2018-May 41000.000
5 2018-Jun 61200.000
6 2018-Jul 71000.000
7 2018-Aug 71000.000
8 2018-Sep NaN
super simple and ugly way to do it using pd.DataFrame.iloc
to_fill = [42000,41000,61200,71000,71000]
df.iloc[54:59,1] = to_fill

Pandas groupby on one column and then filter based on quantile value of another column

I am trying to filter my data down to only those rows in the bottom decile of the data for any given date. Thus, I need to groupby the date first to get the sub-universe of data and then from there filter that same sub-universe down to only those values falling in the bottom decile. I then need to aggregate all of the different dates back together to make one large dataframe.
For example, I want to take the following df:
df = pd.DataFrame([['2017-01-01', 1], ['2017-01-01', 5], ['2017-01-01', 10], ['2018-01-01', 5], ['2018-01-01', 10]], columns=['date', 'value'])
and only those rows where the value is in the bottom decile for that date (below 1.8 and 5.5, respectively):
date value
0 '2017-01-01' 1
1 '2018-01-01' 5
I can get a series of the bottom decile using df.groupby(['date'], 'value'].quantile(.1), but this would then require me to iterate through the entire df and compare the value to the quantile value in the series, which I'm trying to avoid due to performance issues.
Something like this?
df.groupby('date').value.apply(lambda x: x[x < x.quantile(.1)]).reset_index(1,drop = True).reset_index()
date value
0 2017-01-01 1
1 2018-01-01 5
Edit:
df.loc[df['value'] < df.groupby('date').value.transform(lambda x: x.quantile(.1))]