I have a file of data consisting of dates in column one and a series of measurements in columns 2 thru n. I like that Pandas understands dates but I can't figure out how to do simple best fit line. Using np.polyfit is easy but it doesn't understand dates. A sample of my attempt follows.
from datetime import datetime
from StringIO import StringIO
import pandas as pd
zdata = '2013-01-01, 5.00, 100.0 \n 2013-01-02, 7.05, 98.2 \n 2013-01-03, 8.90, 128.0 \n 2013-01-04, 11.11, 127.2 \n 2013-01-05 13.08, 140.0'
unames = ['date', 'm1', 'm2']
df = pd.read_table(StringIO(zdata), sep="[ ,]*", header=None, names=unames, \
parse_dates=True, index_col=0)
Y = pd.Series(df['m1'])
model = pd.ols(y=Y, x=df, intercept=True)
In [232]: model.beta['m1']
Out[232]: 0.99999999999999822
In [233]: model.beta['intercept']
Out[233]: -7.1054273576010019e-15
How do I interpret those numbers? If I use 1,2..5 instead of dates np.polyfit gives [ 2.024, 2.958]
which are slope and intercept I expect.
I looked for simple examples but didn't find any.
I believe you're doing multiple linear regression with the code you provided:
-------------------------Summary of Regression Analysis-------------------------
Formula: Y ~ <m1> + <m2> + <intercept>
Number of Observations: 5
Number of Degrees of Freedom: 3
R-squared: 1.0000
Adj R-squared: 1.0000
Rmse: 0.0000
F-stat (2, 2): inf, p-value: 0.0000
Degrees of Freedom: model 2, resid 2
-----------------------Summary of Estimated Coefficients------------------------
Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%
--------------------------------------------------------------------------------
m1 1.0000 0.0000 271549416425785.53 0.0000 1.0000 1.0000
m2 -0.0000 0.0000 -0.09 0.9382 -0.0000 0.0000
intercept -0.0000 0.0000 -0.02 0.9865 -0.0000 0.0000
---------------------------------End of Summary---------------------------------
Note the formula for regression: Y ~ <m1> + <m2> + <intercept>. If you want a simple linear regression for m1 and m2 separately, then you should create Xs:
X = pd.Series(range(1, len(df) + 1), index=df.index)
And make the regression:
model = pd.ols(y=Y, x=X, intercept=True)
Result:
-------------------------Summary of Regression Analysis-------------------------
Formula: Y ~ <x> + <intercept>
Number of Observations: 5
Number of Degrees of Freedom: 2
R-squared: 0.9995
Adj R-squared: 0.9993
Rmse: 0.0861
F-stat (1, 3): 5515.0414, p-value: 0.0000
Degrees of Freedom: model 1, resid 3
-----------------------Summary of Estimated Coefficients------------------------
Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%
--------------------------------------------------------------------------------
x 2.0220 0.0272 74.26 0.0000 1.9686 2.0754
intercept 2.9620 0.0903 32.80 0.0001 2.7850 3.1390
---------------------------------End of Summary---------------------------------
It's a bit weird that you got slightly different numbers when using np.polyfit. Here's my output:
[ 2.022 2.962]
Which is the same as pandas' ols output. I checked this with scipy's linregress and got the same result.
Related
I want to make a beta calculation in my dataframe, where beta = Σ(daily returns - mean daily return) * (daily market returns - mean market return) / Σ (daily market returns - mean market return)**2
But I want my beta calculation to apply to specific firms. In my dataframe, each firm as an ID code number (specified in column 1), and I want each ID code to be associated with its unique beta.
I tried groupby, loc and for loop, but it seems to always return an error since the beta calculation is quite long and requires many parenthesis when inserted.
Any idea how to solve this problem? Thank you!
Dataframe:
index ID price daily_return mean_daily_return_per_ID daily_market_return mean_daily_market_return date
0 1 27.50 0.008 0.0085 0.0023 0.03345 01-12-2012
1 2 33.75 0.0745 0.0745 0.00458 0.0895 06-12-2012
2 3 29,20 0.00006 0.00006 0.0582 0.0045 01-05-2013
3 4 20.54 0.00486 0.005125 0.0009 0.0006 27-11-2013
4 1 21.50 0.009 0.0085 0.0846 0.04345 04-05-2014
5 4 22.75 0.00539 0.005125 0.0003 0.0006
I assume the following form of your equation is what you intended.
Then the following should compute the beta value for each group
identified by ID.
Method 1: Creating our own function to output beta
import pandas as pd
import numpy as np
# beta_data.csv is a csv version of the sample data frame you
# provided.
df = pd.read_csv("./beta_data.csv")
def beta(daily_return, daily_market_return):
"""
Returns the beta calculation for two pandas columns of equal length.
Will return NaN for columns that have just one row each. Adjust
this function to account for groups that have only a single value.
"""
mean_daily_return = np.sum(daily_return) / len(daily_return)
mean_daily_market_return = np.sum(daily_market_return) / len(daily_market_return)
num = np.sum(
(daily_return - mean_daily_return)
* (daily_market_return - mean_daily_market_return)
)
denom = np.sum((daily_market_return - mean_daily_market_return) ** 2)
return num / denom
# groupby the column ID. Then 'apply' the function we created above
# columnwise to the two desired columns
betas = df.groupby("ID")["daily_return", "daily_market_return"].apply(
lambda x: beta(x["daily_return"], x["daily_market_return"])
)
print(f"betas: {betas}")
Method 2: Using pandas' builtin statistical functions
Notice that beta as stated above is just covarianceof DR and
DMR divided by variance of DMR. Therefore we can write the above
program much more concisely as follows.
import pandas as pd
import numpy as np
df = pd.read_csv("./beta_data.csv")
def beta(dr, dmr):
"""
dr: daily_return (pandas columns)
dmr: daily_market_return (pandas columns)
TODO: Fix the divided by zero erros etc.
"""
num = dr.cov(dmr)
denom = dmr.var()
return num / denom
betas = df.groupby("ID")["daily_return", "daily_market_return"].apply(
lambda x: beta(x["daily_return"], x["daily_market_return"])
)
print(f"betas: {betas}")
The output in both cases is.
ID
1 0.012151
2 NaN
3 NaN
4 -0.883333
dtype: float64
The reason for getting NaNs for IDs 2 and 3 is because they only have a single row each. You should modify the function beta to accomodate these corner cases.
Maybe you can start like this?
id_list = list(set(df["ID"].values.tolist()))
for firm_id in id_list:
new_df = df.loc[df["ID"] == firm_id]
I have a df containing column of "Income_group", "Rate", and "Probability", respectively. I need randomly select rate for each income group. How can I write a Loop function and print out the result for each income bin.
The pandas data frame table looks like this:
import pandas as pd
df={'Income_Groups':['1','1','1','2','2','2','3','3','3'],
'Rate':[1.23,1.25,1.56, 2.11,2.32, 2.36,3.12,3.45,3.55],
'Probability':[0.25, 0.50, 0.25,0.50,0.25,0.25,0.10,0.70,0.20]}
df2=pd.DataFrame(data=df)
df2
Datatable
Shooting in the dark here, but you can use np.random.choice:
(df2.groupby('Income_Groups')
.apply(lambda x: np.random.choice(x['Rate'], p=x['Probability']))
)
Output (can vary due to randomness):
Income_Groups
1 1.25
2 2.36
3 3.45
dtype: float64
You can also pass size into np.random.choice:
(df2.groupby('Income_Groups')
.apply(lambda x: np.random.choice(x['Rate'], size=3, p=x['Probability']))
)
Output:
Income_Groups
1 [1.23, 1.25, 1.25]
2 [2.36, 2.11, 2.11]
3 [3.12, 3.12, 3.45]
dtype: object
GroupBy.apply because of the weights.
import numpy as np
(df2.groupby('Income_Groups')
.apply(lambda gp: np.random.choice(a=gp.Rate, p=gp.Probability, size=1)[0]))
#Income_Groups
#1 1.23
#2 2.11
#3 3.45
#dtype: float64
Another silly way because your weights seem to be have precision to 2 decimal places:
s = df2.set_index(['Income_Groups', 'Probability']).Rate
(s.repeat(s.index.get_level_values('Probability')*100) # Weight
.sample(frac=1) # Shuffle |
.reset_index() # + | -> Random Select
.drop_duplicates(subset=['Income_Groups']) # Select |
.drop(columns='Probability'))
# Income_Groups Rate
#0 2 2.32
#1 1 1.25
#3 3 3.45
I have the following code which computes some aggregations for my data frame:
def percentile(n):
def percentile_(x):
return np.percentile(x, n)
percentile_.__name__ = 'percentile_%s' % n
return percentile_
df_type = df[['myType', 'required_time']].groupby(['myType']).agg(['count', 'min', 'max', 'median', 'mean', 'std', percentile(25), percentile(75)])
The code works fine. However, now I want to compute mean and std just using the data within [25 percentile and 75 percentile], what would be the most elegant way in Pandas to achieve this? Thanks!
You can try using quantile and describe, is this works for yours
df[['myType', 'required_time']].groupby(['myType']).quantile([0.25,0.5]).describe()
Out:
RandomForestClassifier AdaBoostClassifier GaussianNB
count 2.000000 2.000000 2.000000
mean 0.596761 0.627393 0.580476
std 0.496570 0.463766 0.491389
min 0.245632 0.299462 0.233012
25% 0.421196 0.463427 0.406744
50% 0.596761 0.627393 0.580476
75% 0.772325 0.791359 0.754208
max 0.947889 0.955325 0.927941
I have a question on how to this task. I want to return or group a series of numbers in my data frame, the numbers are from the column 'PD' which ranges from .001 to 1. What I want to do is to group those that are .91>'PD'>.9 to .91 (or return a value of .91), .92>'PD'>=.91 to .92, ..., 1>='PD' >=.99 to 1. onto a column named 'Grouping'. What I have been doing is manually doing each if statement then merging it with the base data frame. Can anyone please help me with a more efficient way of doing this? Still on the early stages of using python. Sorry if the question seems to be easy. Thank you for answering and for your time.
Let your data look like this
>>> df = pd.DataFrame({'PD': np.arange(0.001, 1, 0.001), 'data': np.random.randint(10, size=999)})
>>> df.head()
PD data
0 0.001 6
1 0.002 3
2 0.003 5
3 0.004 9
4 0.005 7
Then cut-off the last decimal of the PD column. This is a bit tricky since you get a lot of issues with rounding when doing it without str conversion. E.g.
>>> df['PD'] = df['PD'].apply(lambda x: float('{:.3f}'.format(x)[:-1]))
>>> df.tail()
PD data
994 0.99 1
995 0.99 3
996 0.99 2
997 0.99 1
998 0.99 0
Now you can use the pandas-groupby. Do with data whatever you want, e.g.
>>> df.groupby('PD').agg(lambda x: ','.join(map(str, x)))
data
PD
0.00 6,3,5,9,7,3,6,8,4
0.01 3,5,7,0,4,9,7,1,7,1
0.02 0,0,9,1,5,4,1,6,7,3
0.03 4,4,6,4,6,5,4,4,2,1
0.04 8,3,1,4,6,5,0,6,0,5
[...]
Note that the first row is one item shorter due to missing 0.000 in my sample.
pandas.describe() function generate descriptive statistics that summarize the dataset, excluding NaN values. But does the exclusion here means that the total count (i.e., rows of a variable) vary or fixed?
For example, I calculate the mean by using describe() for a df with missing values:
varA
1
1
1
1
NaN
Is the mean = 4/5 or 4/4 here?
And how does it apply to other results in describe? For example, the standard deviation, quartiles?
Thanks!
As ayhan pointed out, in the current 0.21 release NaN values are excluded from all summary statistics provided by pandas.DataFrame.describe().
With NaN:
data_with_nan = list(range(20)) + [np.NaN]*20
df = pd.DataFrame(data=data_with_nan, columns=['col1'])
df.describe()
col1
count 20.00000
mean 9.50000
std 5.91608
min 0.00000
25% 4.75000
50% 9.50000
75% 14.25000
max 19.00000
Without:
data_without_nan = list(range(20))
df = pd.DataFrame(data=data_without_nan, columns=['col1'])
df.describe()
col1
count 20.00000
mean 9.50000
std 5.91608
min 0.00000
25% 4.75000
50% 9.50000
75% 14.25000
max 19.00000