Pandas: compute mean and std within [25 percentile and 75 percentile] - pandas

I have the following code which computes some aggregations for my data frame:
def percentile(n):
def percentile_(x):
return np.percentile(x, n)
percentile_.__name__ = 'percentile_%s' % n
return percentile_
df_type = df[['myType', 'required_time']].groupby(['myType']).agg(['count', 'min', 'max', 'median', 'mean', 'std', percentile(25), percentile(75)])
The code works fine. However, now I want to compute mean and std just using the data within [25 percentile and 75 percentile], what would be the most elegant way in Pandas to achieve this? Thanks!

You can try using quantile and describe, is this works for yours
df[['myType', 'required_time']].groupby(['myType']).quantile([0.25,0.5]).describe()
Out:
RandomForestClassifier AdaBoostClassifier GaussianNB
count 2.000000 2.000000 2.000000
mean 0.596761 0.627393 0.580476
std 0.496570 0.463766 0.491389
min 0.245632 0.299462 0.233012
25% 0.421196 0.463427 0.406744
50% 0.596761 0.627393 0.580476
75% 0.772325 0.791359 0.754208
max 0.947889 0.955325 0.927941

Related

Pandas Groupby Weighted Standard Deviation

I have a dataframe:
Type Weights Value ....
0 W 0.5 15
1 C 1.2 19
2 W 12 25
3 C 7.1 15 .....
.......
.......
I want to group on type and then calculate weighted mean and weighted standard deviation.
There seem to be solution available for weighted mean (groupby weighted average and sum in pandas dataframe) but none for weighted standard deviation.
Is there a simple way to do it.
I have used the weighted standard deviation formula from the following link:
https://doc-archives.microstrategy.com/producthelp/10.7/FunctionsRef/Content/FuncRef/WeightedStDev__weighted_standard_deviation_of_a_sa.htm
However you can modify for a different formula
import numpy as np
def weighted_sd(input_df):
weights = input_df['Weights']
vals = input_df['Value']
numer = np.sum(weights * (vals - vals.mean())**2)
denom = ((vals.count()-1)/vals.count())*np.sum(weights)
return np.sqrt(numer/denom)
print(df.groupby('Type').apply(weighted_sd))
Minor correction to the weighted standard deviation formula from the previous answer.
import numpy as np
def weighted_sd(input_df):
weights = input_df['Weights']
vals = input_df['Value']
weighted_avg = np.average(vals, weights=weights)
numer = np.sum(weights * (vals - weighted_avg)**2)
denom = ((vals.count()-1)/vals.count())*np.sum(weights)
return np.sqrt(numer/denom)
print(df.groupby('Type').apply(weighted_sd))

How to perform calculations on pandas groupby dataframes

I have a dataframe which for the sake of showing a minimal example, I simplified to this:
df = pd.DataFrame({'pnl':[+10,+23,+15,-5,+20],'style':['obs','obs','obs','bf','bf']})
I would like to do the following:
Group the dataframe by style
Count the positive entries of pnl and divide by the total of entries of that same style.
For example, style 'bf' has 2 entries, one positive and one negative, so 1/2 (total) = 0.5.
This should yield the following result:
style win_rate
bf 0.5
osb 1
dtype: float64
I thought of having a list of the groups, iterate over them and build a new df... But it seems to me like an antipattern. I am pretty sure there is an easier / more pythonic solution.
Thanks.
You can do groupby on df['pnl'].gt(0) series with df['style'] and calculate mean
In [14]: df['pnl'].gt(0).groupby(df['style']).mean()
Out[14]:
style
bf 0.5
obs 1.0
Name: pnl, dtype: float64
You can try pd.crosstab, which uses groupby in the background, to get percentage of both positive and non-positive:
pd.crosstab(df['style'], df['pnl'].gt(0), normalize='index')
Output:
pnl False True
style
bf 0.5 0.5
obs 0.0 1.0

How do I pick the sum of two columns using the query function

Because I find myself typing the following patterns frequently in Pandas
(dataframe['colA'] / dataframe['colB']).describe()
I am trying to do this using the more succinct query function.
dataframe.query("colA / colB").describe()
Unfortunately, the usage above does not work.
Any suggestions to get it working?
You can't use query for this.
On the other hand, you could use eval:
In [63]: df = pd.DataFrame({"colA": [1,2,3], "colB": [3,4,5]})
In [64]: df.eval("colA / colB")
Out[64]:
0 0.333333
1 0.500000
2 0.600000
dtype: float64
In [65]: df.eval("colA / colB").describe()
Out[65]:
count 3.000000
mean 0.477778
std 0.134715
min 0.333333
25% 0.416667
50% 0.500000
75% 0.550000
max 0.600000
dtype: float64
but honestly I don't think this pattern is as convenient as you may think it's going to be. YMMV, of course.

whether pandas.decribe() excludes missing value rows?

pandas.describe() function generate descriptive statistics that summarize the dataset, excluding NaN values. But does the exclusion here means that the total count (i.e., rows of a variable) vary or fixed?
For example, I calculate the mean by using describe() for a df with missing values:
varA
1
1
1
1
NaN
Is the mean = 4/5 or 4/4 here?
And how does it apply to other results in describe? For example, the standard deviation, quartiles?
Thanks!
As ayhan pointed out, in the current 0.21 release NaN values are excluded from all summary statistics provided by pandas.DataFrame.describe().
With NaN:
data_with_nan = list(range(20)) + [np.NaN]*20
df = pd.DataFrame(data=data_with_nan, columns=['col1'])
df.describe()
col1
count 20.00000
mean 9.50000
std 5.91608
min 0.00000
25% 4.75000
50% 9.50000
75% 14.25000
max 19.00000
Without:
data_without_nan = list(range(20))
df = pd.DataFrame(data=data_without_nan, columns=['col1'])
df.describe()
col1
count 20.00000
mean 9.50000
std 5.91608
min 0.00000
25% 4.75000
50% 9.50000
75% 14.25000
max 19.00000

Why am I getting 0 for Series.prod()?

I have a series consisting of either positive numbers or nan. But when I compute the product, I get 0.
Sample Output :
In [14]: pricerelatives.mean()
Out[14]: 0.99110019490541013
In [15]: pricerelatives.prod()
Out[15]: 0.0
In [16]: len(pricerelatives)
Out[16]: 362698
In [17]: (pricerelatives>0).sum()
Out[17]: 223522
In [18]: (pricerelatives.isnull()).sum()
Out[18]: 139176
In [19]: 223522+139176
Out[19]: 362698
Why I am getting 0 for pricerelatives.prod()?
Update:
Thanks for the quick response. Unfortunately, it did not work:
In [32]: import operator
In [33]: from functools import reduce
In [34]: lst = list(pricerelatives.fillna(1))
In [35]: the_prod = reduce(operator.mul, lst)
In [36]: the_prod
Out[36]: 0.0
Explicitly getting rid of nulls also fails:
In [37]: pricerelatives[pricerelatives.notnull()].prod()
Out[37]: 0.0
Update 2:
Indeed, that's exactly what I just did and was going to add.
In [39]: pricerelatives.describe()
Out[39]:
count 223522.000000
mean 0.991100
std 0.088478
min 0.116398
25% 1.000000
50% 1.000000
75% 1.000000
max 11.062591
dtype: float64
Update 3: Still seems strange to me. So more detailed information:
In [46]: pricerelatives[pricerelatives<1].describe()
Out[46]:
count 50160.000000
mean 0.922993
std 0.083865
min 0.116398
25% 0.894997
50% 0.951488
75% 0.982058
max 1.000000
dtype: float64
Update 4: The ratio is right around your example's cutoff between 0 and >0 but my numbers are much more clustered around 1 than uniform 0,1 and uniform 1,2.
In [52]: 50160./223522
Out[52]: 0.2244074408783028
In [53]: pricerelatives[pricerelatives>=1].describe()
Out[53]:
count 173362.000000
mean 1.010806
std 0.079548
min 1.000000
25% 1.000000
50% 1.000000
75% 1.000000
max 11.062591
dtype: float64
In [54]: pricerelatives[pricerelatives<1].prod()
Out[54]: 0.0
This looks like a "bug" in numpy; see here. It doesn't raise when there's overflow.
Here are some examples:
In [26]: prod(poisson(10, size=30))
Out[26]: -2043494819862020096
In [46]: prod(randn(10000))
Out[46]: 0.0
You'll have to use the long (Python 2) or int (Python 3) type and reduce it using reduce/functools.reduce:
import operator
from functools import reduce
lst = list(pricerelatives.dropna())
the_prod = reduce(operator.mul, lst)
EDIT: It's going to be faster to just remove all of the NaNs and then compute the product rather than setting them to 1 first.
Very informally, the reason you're still getting zero is that the product will approach zero faster as the ratio of the number of values in [0, 1) to values >= 1 grows.
def nnz_ratio(ratio, size=1000):
n1 = ratio * size
n2 = size - n1
s1 = uniform(1, 2, size=n1)
s2 = uniform(0, 1, size=n2)
return Series(hstack((s1, s2)))
ratios = linspace(0.01, 1, 25)
ss = empty(len(ratios))
for i, ratio in enumerate(ratios):
ss[i] = nnz_ratio(ratio).prod()
ss
gives:
array([ 0.0000e+000, 0.0000e+000, 0.0000e+000, 0.0000e+000,
0.0000e+000, 3.6846e-296, 2.6969e-280, 1.2799e-233,
2.0497e-237, 4.9666e-209, 6.5059e-181, 9.8479e-171,
7.7879e-125, 8.2696e-109, 9.3416e-087, 4.1574e-064,
3.9266e-036, 4.1065e+004, 6.6814e+018, 7.1501e+040,
6.2192e+070, 1.3523e+093, 1.0739e+110, 1.5646e+144,
8.6361e+163])
EDIT #2:
If you're computing the geometric mean, use
from scipy.stats import gmean
gm = gmean(pricerelatives.dropna())