how is pandas kurtosis defined? - pandas

I am trying to get kurtosis using pandas. By doing some exploration, I have
test_series = pd.Series(np.random.randn(5000))
test_series.kurtosis()
however, the output is:
-0.006755982906479385
But I think the kurtosis (https://en.wikipedia.org/wiki/Kurtosis) should be close to (maybe normalize over N-1 instead of N, but this does not matter here)
(test_series - test_series.mean()).pow(4).mean()/np.power(test_series.std(),4)
which is
2.9908543104146026

The pandas documentation says the following
Return unbiased kurtosis over requested axis using Fisher’s definition of kurtosis (kurtosis of normal == 0.0)
This is probably the excess kurtosis, defined as kurtosis - 3.

Pandas is calculating the UNBIASED estimator of the excess Kurtosis. Kurtosis is the normalized 4th central moment. To find the unbiased estimators of the cumulants you need the k-statistics.
So the unbiased estimator of kurtosis is (k4/k2**2)
To illustrate this:
import pandas as pd
import numpy as np
np.random.seed(11234)
test_series = pd.Series(np.random.randn(5000))
test_series.kurtosis()
#-0.0411811269445872
Now we can calculate this explicitly using the k-statistics:
n = len(test_series)
S1 = test_series.pow(1).sum()
S2 = test_series.pow(2).sum()
S3 = test_series.pow(3).sum()
S4 = test_series.pow(4).sum()
# Eq (7) and (5) from the k-statistics link
k4 = (-6*S1**4 + 12*n*S1**2*S2 - 3*n*(n-1)*S2**2 -4*n*(n+1)*S1*S3 + n**2*(n+1)*S4)/(n*(n-1)*(n-2)*(n-3))
k2 = (n*S2-S1**2)/(n*(n-1))
# k2 is the same as the N-1 variance: test_series.std(ddof=1)**2
k4/k2**2
#-0.04118112694458816
If you want better agreement to more decimal places, you'll need to be careful with the sums as they get rather large. But they're identical to 12 places.

Related

How to detect multivariate outliers within large dataset?

How do I detect multivariate outliers within large data with more than 50 variables. Do i need to plot all of the variables or do i have to group them based independent and dependent variables or do i need an algorithm for this?
We do have a special type of distance formula that we use to find multivariate outliers. It is called Mahalanobis Distance.
The MD is a metric that establishes the separation between a distribution D and a data point x by generalizing the z-score, the MD indicates how far x is from the D mean in terms of standard deviations.
You can use the below function to find out outliers. It returns the index of outliers.
from scipy.stats import chi2
import scipy as sp
import numpy as np
def mahalanobis_method(df):
#M-Distance
x_minus_mean = df - np.mean(df)
cov = np.cov(df.values.T) #Covariance
inv_covmat = sp.linalg.inv(cov) #Inverse covariance
left_term = np.dot(x_minus_mean, inv_covmat)
mahal = np.dot(left_term, x_minus_mean.T)
md = np.sqrt(mahal.diagonal())
#Flag as outliers
outliers = []
#Cut-off point
C = np.sqrt(chi2.ppf((1-0.001), df=df.shape[1])) #degrees of freedom = number of variables
for i, v in enumerate(md):
if v > C:
outliers.append(i)
else:
continue
return outliers, md
If you want to study more about Mahalanobis Distance and its formula you can read this blog.
So, how to understand the above formula? Let’s take the (x – m)^T . C^(-1) term. (x – m) is essentially the distance of the vector from the mean. We then divide this by the covariance matrix (or multiply by the inverse of the covariance matrix). If you think about it, this is essentially a multivariate equivalent of the regular standardization (z = (x – mu)/sigma).

To find an inverse matrix of A with LU decomposition

The task asks me to generate A matrix with 50 columns and 50 rows with a random library of seed 1007092020 in the range [0,1].
import numpy as np
np.random.seed(1007092020)
A = np.random.randint(2, size=(3,3))
Then I have to find an inverse matrix of A with LU decomposition.
No idea how to do that.
If you need matrix A to be a 50 x 50 matrix with random floating numbers, then you can make that with the following code :
import numpy as np
np.random.seed(1007092020)
A = np.random.random((50,50))
Instead, if you want integers in the range 0,1 (1 included), you can do this
A = np.random.randint(0,2,(50,50))
If you want to compute the inverse using LU decomposition, you can use SciPy. It should be noted that since you are generating random matrices, it is possible that your matrix does not have an inverse. In that case, you can not find the inverse.
Here's some code that will work in case A does have an inverse.
from scipy.linalg import lu
p,l,u = lu(A, permute_l = False)
Now that we have the lower (l) and upper (u) triangular matrices, we can find the inverse of A by the following equation : A^-1 = U^-1 L^-1
l = np.dot(p,l)
l_inv = np.linalg.inv(l)
u_inv = np.linalg.inv(u)
A_inv = np.dot(u_inv,l_inv)

np.poly1d: how to calculate R^2

I am fitting my data to a linear regression. But I want to know how to calculate the R2 values. The following is the code I have so far.
total_csv= pd.read_csv('IgG1_sigma_biospin_neg.csv',header=0).iloc[:,:]
x_values=(19,20,21,22)
y_values=IgG1_sigma_biospin_neg.loc[0, ['19-', '20-', '21-', '22-']].tolist()
my_fitting= np.polyfit(x_values,y_values,1)
my_lin_fitting = np.poly1d(my_fitting)
my_x=Symbol('x')
print('my_equation:',expand(my_lin_fitting (my_x)))
I get the equation of the linear fitting of my data 35.6499591999999*x + 6018.6395529.
In [95]:y_values
Out[95]: [6698.0902240000005, 6733.253559000001, 6757.754712999999, 6808.75637]
Do you know how to calculate R2 values?
To the best of my knowledege, np.polyfit does not provide a coefficient of determination (R2).
The residual that Richard mentioned in his answer is something different, named Sum of Squares Error (SSE). More info about it here:
https://365datascience.com/tutorials/statistics-tutorials/sum-squares/
Good news is, you can easily calculate R2 from SSE. First you calculate the Sum of Square Total (SST), then the R2 is merely R2 = 1 - SSE / SST. (See above link for further explanations.)
import numpy as np
# generate pseudo-data so the code can be run standalone (nicer for a mwe)
x_values = np.arange(100)
y_values = 3 * x_values + 2 + np.random.random(100)-0.5
my_fitting = np.polyfit(x_values, y_values, 1, full=True)
coeff = my_fitting[0]
### Residual or Sum of Square Error (SSE)
SSE = my_fitting[1][0]
### Determining the Sum of Square Total (SST)
## the squared differences between the observed dependent variable and its mean
diff = y_values - y_values.mean()
square_diff = diff ** 2
SST = square_diff.sum()
### Now getting the coefficient of determination (R2)
R2 = 1 - SSE/SST
print(R2)
Another approach is to use the already implemented function provided by Scikit-learn./
## Alternative using sklearn
from sklearn.metrics import r2_score
predict = np.poly1d(coeff)
R2 = r2_score(y_values, predict(x_values))
print(R2)
Both methods give me the very same answer.
first thing - you should be using np.polynomial.polynomial Class/methods instead of np.polyfit (see the doc's on np.polyfit, pointing people to use the newer code)
You can then use the polyfit method there. It will by default only return the coefficients. If you want the residual (R2), then specify full=True. polyfit will then also return a list, with the first element the residual (R2). See here.
The mod to your code above would be below:
import numpy.polynomial.polynomial as poly
my_fitting, stats = poly.polyfit(x_values,y_values,1, full=True)
R2 = stats[0][0]

How to use df.rolling(window, min_periods, win_type='exponential').sum()

I would like to calculate the rolling exponentially weighted mean with df.rolling().mean(). I get stuck at the win_type = 'exponential'.
I have tried other *win_types such as 'gaussian'. I think there would be sth a little different from 'exponential'.
dfTemp.rolling(window=21, min_periods=10, win_type='gaussian').mean(std=1)
# works fine
but when it comes to 'exponential',
dfTemp.rolling(window=21, min_periods=10, win_type='exponential').mean(tau=10)
# ValueError: The 'exponential' window needs one or more parameters -- pass a tuple.
How to use win_type='exponential'... Thanks~~~
I faced same issue and asked it on Russian SO:
Got the following answer:
x.rolling(window=(2,10), min_periods=1, win_type='exponential').mean(std=0.1)
You should pass tau value to window=(2, 10) parameter directly where 10 is a value for tau.
I hope it will help! Thanks to #MaxU
You can easily implement any kind of window by definining your kernel function.
Here's an example for a backward-looking exponential average:
import pandas as pd
import numpy as np
# Kernel function ( backward-looking exponential )
def K(x):
return np.exp(-np.abs(x)) * np.where(x<=0,1,0)
# Exponenatial average function
def exp_average(values):
N = len(values)
exp_weights = list(map(K, np.arange(-N,0) / N ))
return values.dot(exp_weights) / N
# Create a sample DataFrame
df = pd.DataFrame({
'date': [pd.datetime(2020,1,1)]*50 + [pd.datetime(2020,1,2)]*50,
'x' : np.random.randn(100)
})
# Finally, compute the exponenatial moving average using `rolling` and `apply`
df['mu'] = df.groupby(['date'])['x'].rolling(5).apply(exp_average, raw=True).values
df.head(10)
Notice that, if N is fixed, you can significantly reduce the execution time by keeping the weights constant:
N = 10
exp_weights = list(map(K, np.arange(-N,0) / N ))
def exp_average(values):
return values.dot(exp_weights) / N
Short answer: you should use pass tau to the applied function, e.g., rolling(d, win_type='exponential').sum(tau=10). Note that the mean function does not respect the exponential window as expected, so you may need to use sum(tau=10)/window_size to calculate the exponential mean. This is a BUG of current version Pandas (1.0.5).
Full example:
# To calculate the rolling exponential mean
import numpy as np
import pandas as pd
window_size = 10
tau = 5
a = pd.Series(np.random.rand(100))
rolling_mean_a = a.rolling(window_size, win_type='exponential').sum(tau=tau) / window_size
The answer of #Илья Митусов is not correct. With pandas 1.0.5, running the following code raises ValueError: exponential window requires tau:
import pandas as pd
import numpy as np
pd.Series(np.arange(10)).rolling(window=(4, 10), min_periods=1, win_type='exponential').mean(std=0.1)
This code has many problems. First, the 10 in window=(4, 10) is not tau, and will lead to wrong answers. Second, exponential window does not need the parameter std -- only gaussian window needs. Last, the tau should be provided to mean (although mean does not respect the win_type).

Categorical features correlation

I have some categorical features in my data along with continuous ones. Is it a good or absolutely bad idea to hot encode category features to find correlation of it to labels along with other continuous creatures?
There is a way to calculate the correlation coefficient without one-hot encoding the category variable. Cramers V statistic is one method for calculating the correlation of categorical variables. It can be calculated as follows. The following link is helpful. Using pandas, calculate Cramér's coefficient matrix For variables with other continuous values, you can categorize by using cut of pandas.
import numpy as np
import pandas as pd
import scipy.stats as ss
import seaborn as sns
print('Pandas version:', pd.__version__)
# Pandas version: 1.3.0
tips = sns.load_dataset("tips")
tips["total_bill_cut"] = pd.cut(tips["total_bill"],
np.arange(0, 55, 5),
include_lowest=True,
right=False)
def cramers_v(confusion_matrix):
""" calculate Cramers V statistic for categorial-categorial association.
uses correction from Bergsma and Wicher,
Journal of the Korean Statistical Society 42 (2013): 323-328
"""
chi2 = ss.chi2_contingency(confusion_matrix)[0]
n = confusion_matrix.sum()
phi2 = chi2 / n
r, k = confusion_matrix.shape
phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
rcorr = r - ((r-1)**2)/(n-1)
kcorr = k - ((k-1)**2)/(n-1)
return np.sqrt(phi2corr / min((kcorr-1), (rcorr-1)))
confusion_matrix = pd.crosstab(tips["day"], tips["time"])
cramers_v(confusion_matrix.values)
# Out[2]: 0.9386619340722221
confusion_matrix = pd.crosstab(tips["total_bill_cut"], tips["time"])
cramers_v(confusion_matrix.values)
# Out[3]: 0.1649870749498837
please note the .as_matrix() is deprecated in pandas since verison 0.23.0 . use .values instead
I found phik library quite useful in calculating correlation between categorical and interval features. This is also useful for binning numerical features. Try this once: phik documentation
I was looking to do same thing in BigQuery.
For numeric features you can use built in CORR(x,y) function.
For categorical features, you can calculate it as:
cardinality (cat1 x cat2) / max (cardinality(cat1), cardinality(cat2).
Which translates to following SQL:
SELECT
COUNT(DISTINCT(CONCAT(cat1, cat2))) / GREATEST (COUNT(DISTINCT(cat1)), COUNT(DISTINCT(cat2))) as cat1_2,
COUNT(DISTINCT(CONCAT(cat1, cat3))) / GREATEST (COUNT(DISTINCT(cat1)), COUNT(DISTINCT(cat3))) as cat1_3,
....
FROM ...
Higher number means lower correlation.
I used following python script to generate SQL:
import itertools
arr = range(1,10)
query = ',\n'.join(list('COUNT(DISTINCT(CONCAT({a}, {b}))) / GREATEST (COUNT(DISTINCT({a})), COUNT(DISTINCT({b}))) as cat{a}_{b}'.format(a=a,b=b)
for (a,b) in itertools.combinations(arr,2)))
query = 'SELECT \n ' + query + '\n FROM `...`;'
print (query)
It should be straightforward to do same thing in numpy.