Categorical features correlation - pandas

I have some categorical features in my data along with continuous ones. Is it a good or absolutely bad idea to hot encode category features to find correlation of it to labels along with other continuous creatures?

There is a way to calculate the correlation coefficient without one-hot encoding the category variable. Cramers V statistic is one method for calculating the correlation of categorical variables. It can be calculated as follows. The following link is helpful. Using pandas, calculate Cramér's coefficient matrix For variables with other continuous values, you can categorize by using cut of pandas.
import numpy as np
import pandas as pd
import scipy.stats as ss
import seaborn as sns
print('Pandas version:', pd.__version__)
# Pandas version: 1.3.0
tips = sns.load_dataset("tips")
tips["total_bill_cut"] = pd.cut(tips["total_bill"],
np.arange(0, 55, 5),
include_lowest=True,
right=False)
def cramers_v(confusion_matrix):
""" calculate Cramers V statistic for categorial-categorial association.
uses correction from Bergsma and Wicher,
Journal of the Korean Statistical Society 42 (2013): 323-328
"""
chi2 = ss.chi2_contingency(confusion_matrix)[0]
n = confusion_matrix.sum()
phi2 = chi2 / n
r, k = confusion_matrix.shape
phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
rcorr = r - ((r-1)**2)/(n-1)
kcorr = k - ((k-1)**2)/(n-1)
return np.sqrt(phi2corr / min((kcorr-1), (rcorr-1)))
confusion_matrix = pd.crosstab(tips["day"], tips["time"])
cramers_v(confusion_matrix.values)
# Out[2]: 0.9386619340722221
confusion_matrix = pd.crosstab(tips["total_bill_cut"], tips["time"])
cramers_v(confusion_matrix.values)
# Out[3]: 0.1649870749498837
please note the .as_matrix() is deprecated in pandas since verison 0.23.0 . use .values instead

I found phik library quite useful in calculating correlation between categorical and interval features. This is also useful for binning numerical features. Try this once: phik documentation

I was looking to do same thing in BigQuery.
For numeric features you can use built in CORR(x,y) function.
For categorical features, you can calculate it as:
cardinality (cat1 x cat2) / max (cardinality(cat1), cardinality(cat2).
Which translates to following SQL:
SELECT
COUNT(DISTINCT(CONCAT(cat1, cat2))) / GREATEST (COUNT(DISTINCT(cat1)), COUNT(DISTINCT(cat2))) as cat1_2,
COUNT(DISTINCT(CONCAT(cat1, cat3))) / GREATEST (COUNT(DISTINCT(cat1)), COUNT(DISTINCT(cat3))) as cat1_3,
....
FROM ...
Higher number means lower correlation.
I used following python script to generate SQL:
import itertools
arr = range(1,10)
query = ',\n'.join(list('COUNT(DISTINCT(CONCAT({a}, {b}))) / GREATEST (COUNT(DISTINCT({a})), COUNT(DISTINCT({b}))) as cat{a}_{b}'.format(a=a,b=b)
for (a,b) in itertools.combinations(arr,2)))
query = 'SELECT \n ' + query + '\n FROM `...`;'
print (query)
It should be straightforward to do same thing in numpy.

Related

numpy.corrcoeff() MemoryError

Can't understand MemoryError I get using numpy.corrcoeff() to find correlation coefficient between 2 vectors smin & smax as following:
import numpy as np
from numpy import random as rn
r=0.01
sigma=0.2
T=1
K=1
N=252
h=T/N
M = 50000
Z = rn.randn(M,N)
S=np.ones((M,N+1))
smax=np.ones((M,1))
smin=np.ones((M,1))
for i in range(0,N):
S[:,i+1]=S[:,i]*(np.exp((r-(sigma**2)/2)*h+sigma*Z[:,i]*np.sqrt(h)))
for j in range(0,M):
smax[j,:]=np.exp(-r*T)*(np.max(S[j,:])>K)*(np.max(S[j,:])-K)
smin[j,:]=np.exp(-r*T)*(np.min(S[j,:])<K)*(K-np.min(S[j,:]))
c=np.corrcoef(smax,smin)
print(c)
if there is another way to find correlation coeff.,like using pandas it's also good.
The shape of your arrays here is what is the problem. The function documentation states that x is a "1-D or 2-D array containing multiple variables and observations. Each row of x represents a variable, and each column a single observation of all those variables." and that y is an additional set of variables and observations. So this is trying to allocate an array of size (10000, 10000), which is huge.
If you just want to calculate the pearson correlation coefficient between two one dimensional vectors, you can use a much simpler formula than what is implemented here. This documentation has the formula I am referring to.
https://hydroerr.readthedocs.io/en/stable/api/HydroErr.HydroErr.pearson_r.html#HydroErr.HydroErr.pearson_r
But to be able to still use the numpy version you need to pass in the observations and predictions in the same parameter x, and x and y need to be 1D arrays.
import numpy as np
simulated_array = np.random.rand(50000)
observed_array = np.random.rand(50000)
c = np.corrcoef([simulated_array, observed_array])[1, 0]
More explanation about this here.

How to use df.rolling(window, min_periods, win_type='exponential').sum()

I would like to calculate the rolling exponentially weighted mean with df.rolling().mean(). I get stuck at the win_type = 'exponential'.
I have tried other *win_types such as 'gaussian'. I think there would be sth a little different from 'exponential'.
dfTemp.rolling(window=21, min_periods=10, win_type='gaussian').mean(std=1)
# works fine
but when it comes to 'exponential',
dfTemp.rolling(window=21, min_periods=10, win_type='exponential').mean(tau=10)
# ValueError: The 'exponential' window needs one or more parameters -- pass a tuple.
How to use win_type='exponential'... Thanks~~~
I faced same issue and asked it on Russian SO:
Got the following answer:
x.rolling(window=(2,10), min_periods=1, win_type='exponential').mean(std=0.1)
You should pass tau value to window=(2, 10) parameter directly where 10 is a value for tau.
I hope it will help! Thanks to #MaxU
You can easily implement any kind of window by definining your kernel function.
Here's an example for a backward-looking exponential average:
import pandas as pd
import numpy as np
# Kernel function ( backward-looking exponential )
def K(x):
return np.exp(-np.abs(x)) * np.where(x<=0,1,0)
# Exponenatial average function
def exp_average(values):
N = len(values)
exp_weights = list(map(K, np.arange(-N,0) / N ))
return values.dot(exp_weights) / N
# Create a sample DataFrame
df = pd.DataFrame({
'date': [pd.datetime(2020,1,1)]*50 + [pd.datetime(2020,1,2)]*50,
'x' : np.random.randn(100)
})
# Finally, compute the exponenatial moving average using `rolling` and `apply`
df['mu'] = df.groupby(['date'])['x'].rolling(5).apply(exp_average, raw=True).values
df.head(10)
Notice that, if N is fixed, you can significantly reduce the execution time by keeping the weights constant:
N = 10
exp_weights = list(map(K, np.arange(-N,0) / N ))
def exp_average(values):
return values.dot(exp_weights) / N
Short answer: you should use pass tau to the applied function, e.g., rolling(d, win_type='exponential').sum(tau=10). Note that the mean function does not respect the exponential window as expected, so you may need to use sum(tau=10)/window_size to calculate the exponential mean. This is a BUG of current version Pandas (1.0.5).
Full example:
# To calculate the rolling exponential mean
import numpy as np
import pandas as pd
window_size = 10
tau = 5
a = pd.Series(np.random.rand(100))
rolling_mean_a = a.rolling(window_size, win_type='exponential').sum(tau=tau) / window_size
The answer of #Илья Митусов is not correct. With pandas 1.0.5, running the following code raises ValueError: exponential window requires tau:
import pandas as pd
import numpy as np
pd.Series(np.arange(10)).rolling(window=(4, 10), min_periods=1, win_type='exponential').mean(std=0.1)
This code has many problems. First, the 10 in window=(4, 10) is not tau, and will lead to wrong answers. Second, exponential window does not need the parameter std -- only gaussian window needs. Last, the tau should be provided to mean (although mean does not respect the win_type).

how to input data for shapiro wilk test using python scipy

I am trying to do Normality test with my data.
# Method 1
import numpy as np
from scipy.stats import shapiro
data = [1874181.6503, 2428393.05517, 2486600.8183,...] # there are 146 data points
data = np.array(data)
stat, p = shapiro(data)
print('statistics=%.3f, p=%.3f' % (stat, p))
alpha = 0.05
if p > alpha:
print('its Gaussian ')
else:
print('not Gaussian')
----Output----
statistics=0.582, p=0.000
not Gaussian
When i run it i am getting its not gaussian, and when i calculate mean and standard deviation and generate sample using np.random.normal(mu,sigma, 149) (sample shown below )then its showing as Gaussian
# Method 2
import numpy as np
from scipy.stats import shapiro
data = [1874181.6503, 2428393.05517, 2486600.8183,...] # there are 146 data points
data = np.array(data)
d_mu = np.mean(data)
d_sig = np.std(data)
data = np.random.normal(d_mu,d_sig, 146)
stat, p = shapiro(data)
print('statistics=%.3f, p=%.3f' % (stat, p))
alpha = 0.05
if p > alpha:
print('its Gaussian ')
else:
print('not Gaussian')
------ Output ----
statistics=0.987, p=0.212
its Gaussian
I am very new to Data analytics, It will be helpful if someone can help me on the below doubts
Which is the right method to do shapiro test ..? Method 1 or Method 2..?
I have difficulty in understanding the np.random.normal(d_mu,d_sig, 146) function . The definition given in docs is "Draw random samples from a normal (Gaussian) distribution." But what data sample its generating , we already have data(my input data) and we have calculated mean and standard deviation to plot the normal distribution and the function returns some other data sample and my shapiro test works for that ( i know i am completely taking it wrongly, but not able to decide which one is right )
I am trying to do normal distribution for timeseries data . Any docs helpful links any one can suggest ...? to do normality test and to normal distribution . Anything that guide me in the right direction

how is pandas kurtosis defined?

I am trying to get kurtosis using pandas. By doing some exploration, I have
test_series = pd.Series(np.random.randn(5000))
test_series.kurtosis()
however, the output is:
-0.006755982906479385
But I think the kurtosis (https://en.wikipedia.org/wiki/Kurtosis) should be close to (maybe normalize over N-1 instead of N, but this does not matter here)
(test_series - test_series.mean()).pow(4).mean()/np.power(test_series.std(),4)
which is
2.9908543104146026
The pandas documentation says the following
Return unbiased kurtosis over requested axis using Fisher’s definition of kurtosis (kurtosis of normal == 0.0)
This is probably the excess kurtosis, defined as kurtosis - 3.
Pandas is calculating the UNBIASED estimator of the excess Kurtosis. Kurtosis is the normalized 4th central moment. To find the unbiased estimators of the cumulants you need the k-statistics.
So the unbiased estimator of kurtosis is (k4/k2**2)
To illustrate this:
import pandas as pd
import numpy as np
np.random.seed(11234)
test_series = pd.Series(np.random.randn(5000))
test_series.kurtosis()
#-0.0411811269445872
Now we can calculate this explicitly using the k-statistics:
n = len(test_series)
S1 = test_series.pow(1).sum()
S2 = test_series.pow(2).sum()
S3 = test_series.pow(3).sum()
S4 = test_series.pow(4).sum()
# Eq (7) and (5) from the k-statistics link
k4 = (-6*S1**4 + 12*n*S1**2*S2 - 3*n*(n-1)*S2**2 -4*n*(n+1)*S1*S3 + n**2*(n+1)*S4)/(n*(n-1)*(n-2)*(n-3))
k2 = (n*S2-S1**2)/(n*(n-1))
# k2 is the same as the N-1 variance: test_series.std(ddof=1)**2
k4/k2**2
#-0.04118112694458816
If you want better agreement to more decimal places, you'll need to be careful with the sums as they get rather large. But they're identical to 12 places.

pandas: finding the root of a function

I have some data frame in pandas, where the columns can be viewed as smooth functions of the index:
f g
x ------------
0.1 f(0.1) g(0.1)
0.2 f(0.2) g(0.2)
...
And I want to know the x value for some f(x) = y -- where y is a given, and I don't necessarily have a point at the x that I am looking for.
Essentially I want to find the intersection of a line and a data series in pandas. Is there a best way to do this?
Suppose your DataFrame looks something like this:
import numpy as np
import pandas as pd
def unknown_func(x):
return -x ** 3 + 1
x = np.linspace(-10, 10, 100)
df = pd.DataFrame({'f': unknown_func(x)}, index=x)
then, using scipy, you could create an interpolation function:
import scipy.interpolate as interpolate
func = interpolate.interp1d(x, df['f'], kind='linear')
and then use a root finder to solve f(x)-y=0 for x:
import scipy.optimize as optimize
root = optimize.brentq(lambda x: func(x)-y, x.min(), x.max())
import numpy as np
import pandas as pd
import scipy.optimize as optimize
import scipy.interpolate as interpolate
def unknown_func(x):
return -x ** 3 + 1
x = np.linspace(-10, 10, 100)
df = pd.DataFrame({'f': unknown_func(x)}, index=x)
y = 50
func = interpolate.interp1d(x, df['f'], kind='linear')
root = optimize.brentq(lambda x: func(x)-y, x.min(), x.max())
print(root)
# -3.6566397064
print(func(root))
# 50.0
idx = np.searchsorted(df.index.values, root)
print(df.iloc[idx-1:idx+1])
# f
# -3.737374 53.203496
# -3.535354 45.187410
Notice that you need some model for your data. Above, the linear interpolator,
interp1d is implicitly imposing a model for the unknown function that
generated the data.
If you already have a model function (such as unknown_func), then you could use that instead of the func returned by interp1d. If
you have a parametrized model function, then instead of interp1d you could use
optimize.curve_fit to find the best fitting parameters. And if you do choose
to interpolate, there are many other choices (e.g. quadratic or cubic
interpolation) for interpolation which you might use too. What to choose depends on what you think best models your data.