how to input data for shapiro wilk test using python scipy - numpy

I am trying to do Normality test with my data.
# Method 1
import numpy as np
from scipy.stats import shapiro
data = [1874181.6503, 2428393.05517, 2486600.8183,...] # there are 146 data points
data = np.array(data)
stat, p = shapiro(data)
print('statistics=%.3f, p=%.3f' % (stat, p))
alpha = 0.05
if p > alpha:
print('its Gaussian ')
else:
print('not Gaussian')
----Output----
statistics=0.582, p=0.000
not Gaussian
When i run it i am getting its not gaussian, and when i calculate mean and standard deviation and generate sample using np.random.normal(mu,sigma, 149) (sample shown below )then its showing as Gaussian
# Method 2
import numpy as np
from scipy.stats import shapiro
data = [1874181.6503, 2428393.05517, 2486600.8183,...] # there are 146 data points
data = np.array(data)
d_mu = np.mean(data)
d_sig = np.std(data)
data = np.random.normal(d_mu,d_sig, 146)
stat, p = shapiro(data)
print('statistics=%.3f, p=%.3f' % (stat, p))
alpha = 0.05
if p > alpha:
print('its Gaussian ')
else:
print('not Gaussian')
------ Output ----
statistics=0.987, p=0.212
its Gaussian
I am very new to Data analytics, It will be helpful if someone can help me on the below doubts
Which is the right method to do shapiro test ..? Method 1 or Method 2..?
I have difficulty in understanding the np.random.normal(d_mu,d_sig, 146) function . The definition given in docs is "Draw random samples from a normal (Gaussian) distribution." But what data sample its generating , we already have data(my input data) and we have calculated mean and standard deviation to plot the normal distribution and the function returns some other data sample and my shapiro test works for that ( i know i am completely taking it wrongly, but not able to decide which one is right )
I am trying to do normal distribution for timeseries data . Any docs helpful links any one can suggest ...? to do normality test and to normal distribution . Anything that guide me in the right direction

Related

How to use df.rolling(window, min_periods, win_type='exponential').sum()

I would like to calculate the rolling exponentially weighted mean with df.rolling().mean(). I get stuck at the win_type = 'exponential'.
I have tried other *win_types such as 'gaussian'. I think there would be sth a little different from 'exponential'.
dfTemp.rolling(window=21, min_periods=10, win_type='gaussian').mean(std=1)
# works fine
but when it comes to 'exponential',
dfTemp.rolling(window=21, min_periods=10, win_type='exponential').mean(tau=10)
# ValueError: The 'exponential' window needs one or more parameters -- pass a tuple.
How to use win_type='exponential'... Thanks~~~
I faced same issue and asked it on Russian SO:
Got the following answer:
x.rolling(window=(2,10), min_periods=1, win_type='exponential').mean(std=0.1)
You should pass tau value to window=(2, 10) parameter directly where 10 is a value for tau.
I hope it will help! Thanks to #MaxU
You can easily implement any kind of window by definining your kernel function.
Here's an example for a backward-looking exponential average:
import pandas as pd
import numpy as np
# Kernel function ( backward-looking exponential )
def K(x):
return np.exp(-np.abs(x)) * np.where(x<=0,1,0)
# Exponenatial average function
def exp_average(values):
N = len(values)
exp_weights = list(map(K, np.arange(-N,0) / N ))
return values.dot(exp_weights) / N
# Create a sample DataFrame
df = pd.DataFrame({
'date': [pd.datetime(2020,1,1)]*50 + [pd.datetime(2020,1,2)]*50,
'x' : np.random.randn(100)
})
# Finally, compute the exponenatial moving average using `rolling` and `apply`
df['mu'] = df.groupby(['date'])['x'].rolling(5).apply(exp_average, raw=True).values
df.head(10)
Notice that, if N is fixed, you can significantly reduce the execution time by keeping the weights constant:
N = 10
exp_weights = list(map(K, np.arange(-N,0) / N ))
def exp_average(values):
return values.dot(exp_weights) / N
Short answer: you should use pass tau to the applied function, e.g., rolling(d, win_type='exponential').sum(tau=10). Note that the mean function does not respect the exponential window as expected, so you may need to use sum(tau=10)/window_size to calculate the exponential mean. This is a BUG of current version Pandas (1.0.5).
Full example:
# To calculate the rolling exponential mean
import numpy as np
import pandas as pd
window_size = 10
tau = 5
a = pd.Series(np.random.rand(100))
rolling_mean_a = a.rolling(window_size, win_type='exponential').sum(tau=tau) / window_size
The answer of #Илья Митусов is not correct. With pandas 1.0.5, running the following code raises ValueError: exponential window requires tau:
import pandas as pd
import numpy as np
pd.Series(np.arange(10)).rolling(window=(4, 10), min_periods=1, win_type='exponential').mean(std=0.1)
This code has many problems. First, the 10 in window=(4, 10) is not tau, and will lead to wrong answers. Second, exponential window does not need the parameter std -- only gaussian window needs. Last, the tau should be provided to mean (although mean does not respect the win_type).

how is pandas kurtosis defined?

I am trying to get kurtosis using pandas. By doing some exploration, I have
test_series = pd.Series(np.random.randn(5000))
test_series.kurtosis()
however, the output is:
-0.006755982906479385
But I think the kurtosis (https://en.wikipedia.org/wiki/Kurtosis) should be close to (maybe normalize over N-1 instead of N, but this does not matter here)
(test_series - test_series.mean()).pow(4).mean()/np.power(test_series.std(),4)
which is
2.9908543104146026
The pandas documentation says the following
Return unbiased kurtosis over requested axis using Fisher’s definition of kurtosis (kurtosis of normal == 0.0)
This is probably the excess kurtosis, defined as kurtosis - 3.
Pandas is calculating the UNBIASED estimator of the excess Kurtosis. Kurtosis is the normalized 4th central moment. To find the unbiased estimators of the cumulants you need the k-statistics.
So the unbiased estimator of kurtosis is (k4/k2**2)
To illustrate this:
import pandas as pd
import numpy as np
np.random.seed(11234)
test_series = pd.Series(np.random.randn(5000))
test_series.kurtosis()
#-0.0411811269445872
Now we can calculate this explicitly using the k-statistics:
n = len(test_series)
S1 = test_series.pow(1).sum()
S2 = test_series.pow(2).sum()
S3 = test_series.pow(3).sum()
S4 = test_series.pow(4).sum()
# Eq (7) and (5) from the k-statistics link
k4 = (-6*S1**4 + 12*n*S1**2*S2 - 3*n*(n-1)*S2**2 -4*n*(n+1)*S1*S3 + n**2*(n+1)*S4)/(n*(n-1)*(n-2)*(n-3))
k2 = (n*S2-S1**2)/(n*(n-1))
# k2 is the same as the N-1 variance: test_series.std(ddof=1)**2
k4/k2**2
#-0.04118112694458816
If you want better agreement to more decimal places, you'll need to be careful with the sums as they get rather large. But they're identical to 12 places.

Categorical features correlation

I have some categorical features in my data along with continuous ones. Is it a good or absolutely bad idea to hot encode category features to find correlation of it to labels along with other continuous creatures?
There is a way to calculate the correlation coefficient without one-hot encoding the category variable. Cramers V statistic is one method for calculating the correlation of categorical variables. It can be calculated as follows. The following link is helpful. Using pandas, calculate Cramér's coefficient matrix For variables with other continuous values, you can categorize by using cut of pandas.
import numpy as np
import pandas as pd
import scipy.stats as ss
import seaborn as sns
print('Pandas version:', pd.__version__)
# Pandas version: 1.3.0
tips = sns.load_dataset("tips")
tips["total_bill_cut"] = pd.cut(tips["total_bill"],
np.arange(0, 55, 5),
include_lowest=True,
right=False)
def cramers_v(confusion_matrix):
""" calculate Cramers V statistic for categorial-categorial association.
uses correction from Bergsma and Wicher,
Journal of the Korean Statistical Society 42 (2013): 323-328
"""
chi2 = ss.chi2_contingency(confusion_matrix)[0]
n = confusion_matrix.sum()
phi2 = chi2 / n
r, k = confusion_matrix.shape
phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
rcorr = r - ((r-1)**2)/(n-1)
kcorr = k - ((k-1)**2)/(n-1)
return np.sqrt(phi2corr / min((kcorr-1), (rcorr-1)))
confusion_matrix = pd.crosstab(tips["day"], tips["time"])
cramers_v(confusion_matrix.values)
# Out[2]: 0.9386619340722221
confusion_matrix = pd.crosstab(tips["total_bill_cut"], tips["time"])
cramers_v(confusion_matrix.values)
# Out[3]: 0.1649870749498837
please note the .as_matrix() is deprecated in pandas since verison 0.23.0 . use .values instead
I found phik library quite useful in calculating correlation between categorical and interval features. This is also useful for binning numerical features. Try this once: phik documentation
I was looking to do same thing in BigQuery.
For numeric features you can use built in CORR(x,y) function.
For categorical features, you can calculate it as:
cardinality (cat1 x cat2) / max (cardinality(cat1), cardinality(cat2).
Which translates to following SQL:
SELECT
COUNT(DISTINCT(CONCAT(cat1, cat2))) / GREATEST (COUNT(DISTINCT(cat1)), COUNT(DISTINCT(cat2))) as cat1_2,
COUNT(DISTINCT(CONCAT(cat1, cat3))) / GREATEST (COUNT(DISTINCT(cat1)), COUNT(DISTINCT(cat3))) as cat1_3,
....
FROM ...
Higher number means lower correlation.
I used following python script to generate SQL:
import itertools
arr = range(1,10)
query = ',\n'.join(list('COUNT(DISTINCT(CONCAT({a}, {b}))) / GREATEST (COUNT(DISTINCT({a})), COUNT(DISTINCT({b}))) as cat{a}_{b}'.format(a=a,b=b)
for (a,b) in itertools.combinations(arr,2)))
query = 'SELECT \n ' + query + '\n FROM `...`;'
print (query)
It should be straightforward to do same thing in numpy.

Sampling from posterior using custom likelihood in pymc3

I'm trying to create a custom likelihood using pymc3. The distribution is called Generalized Maximum Likelihood (GEV) which has the location (loc), scale (scale) and shape (c) parameters.
The main ideia is to choose a beta distribution as a prior to the scale parameter and fix the location and scale parameters in the GEV likelihood.
The GEV distribuition is not contained in the pymc3 standard distributions, so I have to create a custom likelihood. I googled it and found out that I should use the densitydist method but I don't know why it is incorrect.
See the code below:
import pymc3 as pm
import numpy as np
from theano.tensor import exp
data=np.random.randn(20)
with pm.Model() as model:
c=pm.Beta('c',alpha=6,beta=9)
loc=1
scale=2
gev=pm.DensityDist('gev', lambda value: exp(-1+c*(((value-loc)/scale)^(1/c))), testval=1)
modelo=pm.gev(loc=loc, scale=scale, c=c, observed=data)
step = pm.Metropolis()
trace = pm.sample(1000, step)
pm.traceplot(trace)
I'm sorry in advance if this is a dumb question, but I could'nt figure it out.
I'm studying annual maximum flows and I'm trying to implement the methodology described in "Generalized maximum-likelihood generalized extreme-value
quantile estimators for hydrologic data" written by Martins and Stedinger.
If you mean the generalized extreme value distribution (https://en.wikipedia.org/wiki/Generalized_extreme_value_distribution), then something like this should work (for c != 0):
import pymc3 as pm
import numpy as np
import theano.tensor as tt
from pymc3.distributions.dist_math import bound
data = np.random.randn(20)
with pm.Model() as model:
c = pm.Beta('c', alpha=6, beta=9)
loc = 1
scale = 2
def gev_logp(value):
scaled = (value - loc) / scale
logp = -(scale
+ ((c + 1) / c) * tt.log1p(c * scaled)
+ (1 + c * scaled) ** (-1/c))
alpha = loc - scale / c
bounds = tt.switch(value > 0, value > alpha, value < alpha)
return bound(logp, bounds, c != 0)
gev = pm.DensityDist('gev', gev_logp, observed=data)
trace = pm.sample(2000, tune=1000, njobs=4)
pm.traceplot(trace)
Your logp function was invalid. Exponentiation is ** in python, and part of the expression wasn't valid for negative values.

Fitting partial Gaussian

I'm trying to fit a sum of gaussians using scikit-learn because the scikit-learn GaussianMixture seems much more robust than using curve_fit.
Problem: It doesn't do a great job in fitting a truncated part of even a single gaussian peak:
from sklearn import mixture
import matplotlib.pyplot
import matplotlib.mlab
import numpy as np
clf = mixture.GaussianMixture(n_components=1, covariance_type='full')
data = np.random.randn(10000)
data = [[x] for x in data]
clf.fit(data)
data = [item for sublist in data for item in sublist]
rangeMin = int(np.floor(np.min(data)))
rangeMax = int(np.ceil(np.max(data)))
h = matplotlib.pyplot.hist(data, range=(rangeMin, rangeMax), normed=True);
plt.plot(np.linspace(rangeMin, rangeMax),
mlab.normpdf(np.linspace(rangeMin, rangeMax),
clf.means_, np.sqrt(clf.covariances_[0]))[0])
gives
now changing data = [[x] for x in data] to data = [[x] for x in data if x <0] in order to truncate the distribution returns
Any ideas how to get the truncation fitted properly?
Note: The distribution isn't necessarily truncated in the middle, there could be anything between 50% and 100% of the full distribution left.
I would also be happy if anyone can point me to alternative packages. I've only tried curve_fit but couldn't get it to do anything useful as soon as more than two peaks are involved.
A bit brutish, but simple solution would be to split the curve in two halfs (data = [[x] for x in data if x < 0]), mirror the left part (data.append([-data[d][0]])) and then do the regular Gaussian fit.
import numpy as np
from sklearn import mixture
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
np.random.seed(seed=42)
n = 10000
clf = mixture.GaussianMixture(n_components=1, covariance_type='full')
#split the data and mirror it
data = np.random.randn(n)
data = [[x] for x in data if x < 0]
n = len(data)
for d in range(n):
data.append([-data[d][0]])
clf.fit(data)
data = [item for sublist in data for item in sublist]
rangeMin = int(np.floor(np.min(data)))
rangeMax = int(np.ceil(np.max(data)))
h = plt.hist(data[0:n], bins=20, range=(rangeMin, rangeMax), normed=True);
plt.plot(np.linspace(rangeMin, rangeMax),
mlab.normpdf(np.linspace(rangeMin, rangeMax),
clf.means_, np.sqrt(clf.covariances_[0]))[0] * 2)
plt.show()
lhcgeneva the problem is once you have data that doesn't include the maximum of the curve more and more possible Gaussians can fit:
Black point represent the data, red points the fitted Gaussian
In the figure, black points represent the data to fit a curve, the red points the fitted results. This result was achieved by using A Simple Algorithm for Fitting a Gaussian Function