Sampling from posterior using custom likelihood in pymc3 - bayesian

I'm trying to create a custom likelihood using pymc3. The distribution is called Generalized Maximum Likelihood (GEV) which has the location (loc), scale (scale) and shape (c) parameters.
The main ideia is to choose a beta distribution as a prior to the scale parameter and fix the location and scale parameters in the GEV likelihood.
The GEV distribuition is not contained in the pymc3 standard distributions, so I have to create a custom likelihood. I googled it and found out that I should use the densitydist method but I don't know why it is incorrect.
See the code below:
import pymc3 as pm
import numpy as np
from theano.tensor import exp
data=np.random.randn(20)
with pm.Model() as model:
c=pm.Beta('c',alpha=6,beta=9)
loc=1
scale=2
gev=pm.DensityDist('gev', lambda value: exp(-1+c*(((value-loc)/scale)^(1/c))), testval=1)
modelo=pm.gev(loc=loc, scale=scale, c=c, observed=data)
step = pm.Metropolis()
trace = pm.sample(1000, step)
pm.traceplot(trace)
I'm sorry in advance if this is a dumb question, but I could'nt figure it out.
I'm studying annual maximum flows and I'm trying to implement the methodology described in "Generalized maximum-likelihood generalized extreme-value
quantile estimators for hydrologic data" written by Martins and Stedinger.

If you mean the generalized extreme value distribution (https://en.wikipedia.org/wiki/Generalized_extreme_value_distribution), then something like this should work (for c != 0):
import pymc3 as pm
import numpy as np
import theano.tensor as tt
from pymc3.distributions.dist_math import bound
data = np.random.randn(20)
with pm.Model() as model:
c = pm.Beta('c', alpha=6, beta=9)
loc = 1
scale = 2
def gev_logp(value):
scaled = (value - loc) / scale
logp = -(scale
+ ((c + 1) / c) * tt.log1p(c * scaled)
+ (1 + c * scaled) ** (-1/c))
alpha = loc - scale / c
bounds = tt.switch(value > 0, value > alpha, value < alpha)
return bound(logp, bounds, c != 0)
gev = pm.DensityDist('gev', gev_logp, observed=data)
trace = pm.sample(2000, tune=1000, njobs=4)
pm.traceplot(trace)
Your logp function was invalid. Exponentiation is ** in python, and part of the expression wasn't valid for negative values.

Related

Why does numpy and pytorch give different results after mean and variance normalization?

I am working on a problem in which a matrix has to be mean-var normalized row-wise. It is also required that the normalization is applied after splitting each row into tiny batches.
The code seem to work for Numpy, but fails with Pytorch (which is required for training).
It seems Pytorch and Numpy results differ. Any help will be greatly appreciated.
Example code:
import numpy as np
import torch
def normalize(x, bsize, eps=1e-6):
nc = x.shape[1]
if nc % bsize != 0:
raise Exception(f'Number of columns must be a multiple of bsize')
x = x.reshape(-1, bsize)
m = x.mean(1).reshape(-1, 1)
s = x.std(1).reshape(-1, 1)
n = (x - m) / (eps + s)
n = n.reshape(-1, nc)
return n
# numpy
a = np.float32(np.random.randn(8, 8))
n1 = normalize(a, 4)
# torch
b = torch.tensor(a)
n2 = normalize(b, 4)
n2 = n2.numpy()
print(abs(n1-n2).max())
In the first example you are calling normalize with a, a numpy.ndarray, while in the second you call normalize with b, a torch.Tensor.
According to the documentation page of torch.std, Bessel’s correction is used by default to measure the standard deviation. As such the default behavior between numpy.ndarray.std and torch.Tensor.std is different.
If unbiased is True, Bessel’s correction will be used. Otherwise, the sample deviation is calculated, without any correction.
torch.std(input, dim, unbiased, keepdim=False, *, out=None) → Tensor
Parameters
input (Tensor) – the input tensor.
unbiased (bool) – whether to use Bessel’s correction (δN = 1).
You can try yourself:
>>> a.std(), b.std(unbiased=True), b.std(unbiased=False)
(0.8364538, tensor(0.8942), tensor(0.8365))

How to minimize an objective function containing frobenius and nuclear norms?

subject to the constraint that square of Frobenius norm of Ds matrix has to be less than or equal to 1.
Currently I am using CVXPY library to solve the objective function. My code sample looks like
import cvxpy as cp
import numpy as np
np.random.seed(1)
Xs = np.random.randn(100, 4096)
Ys = np.random.randn(100, 300)
# Define and solve the CVXPY problem.
Ds = cp.Variable(shape=(300, 4096))
lamda1 = 1
obj = cp.Minimize(cp.square(cp.norm(Xs - (Ys * Ds), "fro")) + lamda1 * cp.norm(Ds, "nuc"))
constraints = [cp.square(cp.norm(Ds, "fro")) <= 1]
prob = cp.Problem(obj, constraints)
prob.solve(solver=cp.SCS, verbose=True)
The console gives an error that
Solver 'SCS' failed. Try another solver or solve with verbose=True for more information. Try re-centering the problem data around 0 and re-scaling to reduce the dynamic range.
I event tried to experiment with different solvers like cp.ECOS but they do not optimize the function.
Any suggestions ?

how to input data for shapiro wilk test using python scipy

I am trying to do Normality test with my data.
# Method 1
import numpy as np
from scipy.stats import shapiro
data = [1874181.6503, 2428393.05517, 2486600.8183,...] # there are 146 data points
data = np.array(data)
stat, p = shapiro(data)
print('statistics=%.3f, p=%.3f' % (stat, p))
alpha = 0.05
if p > alpha:
print('its Gaussian ')
else:
print('not Gaussian')
----Output----
statistics=0.582, p=0.000
not Gaussian
When i run it i am getting its not gaussian, and when i calculate mean and standard deviation and generate sample using np.random.normal(mu,sigma, 149) (sample shown below )then its showing as Gaussian
# Method 2
import numpy as np
from scipy.stats import shapiro
data = [1874181.6503, 2428393.05517, 2486600.8183,...] # there are 146 data points
data = np.array(data)
d_mu = np.mean(data)
d_sig = np.std(data)
data = np.random.normal(d_mu,d_sig, 146)
stat, p = shapiro(data)
print('statistics=%.3f, p=%.3f' % (stat, p))
alpha = 0.05
if p > alpha:
print('its Gaussian ')
else:
print('not Gaussian')
------ Output ----
statistics=0.987, p=0.212
its Gaussian
I am very new to Data analytics, It will be helpful if someone can help me on the below doubts
Which is the right method to do shapiro test ..? Method 1 or Method 2..?
I have difficulty in understanding the np.random.normal(d_mu,d_sig, 146) function . The definition given in docs is "Draw random samples from a normal (Gaussian) distribution." But what data sample its generating , we already have data(my input data) and we have calculated mean and standard deviation to plot the normal distribution and the function returns some other data sample and my shapiro test works for that ( i know i am completely taking it wrongly, but not able to decide which one is right )
I am trying to do normal distribution for timeseries data . Any docs helpful links any one can suggest ...? to do normality test and to normal distribution . Anything that guide me in the right direction

how is pandas kurtosis defined?

I am trying to get kurtosis using pandas. By doing some exploration, I have
test_series = pd.Series(np.random.randn(5000))
test_series.kurtosis()
however, the output is:
-0.006755982906479385
But I think the kurtosis (https://en.wikipedia.org/wiki/Kurtosis) should be close to (maybe normalize over N-1 instead of N, but this does not matter here)
(test_series - test_series.mean()).pow(4).mean()/np.power(test_series.std(),4)
which is
2.9908543104146026
The pandas documentation says the following
Return unbiased kurtosis over requested axis using Fisher’s definition of kurtosis (kurtosis of normal == 0.0)
This is probably the excess kurtosis, defined as kurtosis - 3.
Pandas is calculating the UNBIASED estimator of the excess Kurtosis. Kurtosis is the normalized 4th central moment. To find the unbiased estimators of the cumulants you need the k-statistics.
So the unbiased estimator of kurtosis is (k4/k2**2)
To illustrate this:
import pandas as pd
import numpy as np
np.random.seed(11234)
test_series = pd.Series(np.random.randn(5000))
test_series.kurtosis()
#-0.0411811269445872
Now we can calculate this explicitly using the k-statistics:
n = len(test_series)
S1 = test_series.pow(1).sum()
S2 = test_series.pow(2).sum()
S3 = test_series.pow(3).sum()
S4 = test_series.pow(4).sum()
# Eq (7) and (5) from the k-statistics link
k4 = (-6*S1**4 + 12*n*S1**2*S2 - 3*n*(n-1)*S2**2 -4*n*(n+1)*S1*S3 + n**2*(n+1)*S4)/(n*(n-1)*(n-2)*(n-3))
k2 = (n*S2-S1**2)/(n*(n-1))
# k2 is the same as the N-1 variance: test_series.std(ddof=1)**2
k4/k2**2
#-0.04118112694458816
If you want better agreement to more decimal places, you'll need to be careful with the sums as they get rather large. But they're identical to 12 places.

Categorical features correlation

I have some categorical features in my data along with continuous ones. Is it a good or absolutely bad idea to hot encode category features to find correlation of it to labels along with other continuous creatures?
There is a way to calculate the correlation coefficient without one-hot encoding the category variable. Cramers V statistic is one method for calculating the correlation of categorical variables. It can be calculated as follows. The following link is helpful. Using pandas, calculate Cramér's coefficient matrix For variables with other continuous values, you can categorize by using cut of pandas.
import numpy as np
import pandas as pd
import scipy.stats as ss
import seaborn as sns
print('Pandas version:', pd.__version__)
# Pandas version: 1.3.0
tips = sns.load_dataset("tips")
tips["total_bill_cut"] = pd.cut(tips["total_bill"],
np.arange(0, 55, 5),
include_lowest=True,
right=False)
def cramers_v(confusion_matrix):
""" calculate Cramers V statistic for categorial-categorial association.
uses correction from Bergsma and Wicher,
Journal of the Korean Statistical Society 42 (2013): 323-328
"""
chi2 = ss.chi2_contingency(confusion_matrix)[0]
n = confusion_matrix.sum()
phi2 = chi2 / n
r, k = confusion_matrix.shape
phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
rcorr = r - ((r-1)**2)/(n-1)
kcorr = k - ((k-1)**2)/(n-1)
return np.sqrt(phi2corr / min((kcorr-1), (rcorr-1)))
confusion_matrix = pd.crosstab(tips["day"], tips["time"])
cramers_v(confusion_matrix.values)
# Out[2]: 0.9386619340722221
confusion_matrix = pd.crosstab(tips["total_bill_cut"], tips["time"])
cramers_v(confusion_matrix.values)
# Out[3]: 0.1649870749498837
please note the .as_matrix() is deprecated in pandas since verison 0.23.0 . use .values instead
I found phik library quite useful in calculating correlation between categorical and interval features. This is also useful for binning numerical features. Try this once: phik documentation
I was looking to do same thing in BigQuery.
For numeric features you can use built in CORR(x,y) function.
For categorical features, you can calculate it as:
cardinality (cat1 x cat2) / max (cardinality(cat1), cardinality(cat2).
Which translates to following SQL:
SELECT
COUNT(DISTINCT(CONCAT(cat1, cat2))) / GREATEST (COUNT(DISTINCT(cat1)), COUNT(DISTINCT(cat2))) as cat1_2,
COUNT(DISTINCT(CONCAT(cat1, cat3))) / GREATEST (COUNT(DISTINCT(cat1)), COUNT(DISTINCT(cat3))) as cat1_3,
....
FROM ...
Higher number means lower correlation.
I used following python script to generate SQL:
import itertools
arr = range(1,10)
query = ',\n'.join(list('COUNT(DISTINCT(CONCAT({a}, {b}))) / GREATEST (COUNT(DISTINCT({a})), COUNT(DISTINCT({b}))) as cat{a}_{b}'.format(a=a,b=b)
for (a,b) in itertools.combinations(arr,2)))
query = 'SELECT \n ' + query + '\n FROM `...`;'
print (query)
It should be straightforward to do same thing in numpy.