Bayesian IRT Pymc3 - Parameter inference - bayesian

I would like to estimate IRT model using PyMC3.
I generated data with the following distribution:
alpha_fix = 4
beta_fix = 100
theta= np.random.normal(100,15,1000)
prob = np.exp(alpha_fix*(theta-beta_fix))/(1+np.exp(alpha_fix*(theta-beta_fix)))
prob_tt = tt._shared(prob)
Then I created a model using PyMC3 to infer the parameter:
irt = pm.Model()
with irt:
# Priors
alpha = pm.Normal('alpha',mu = 4 , tau = 1)
beta = pm.Normal('beta',mu = 100 , tau = 15)
thau = pm.Normal('thau' ,mu = 100 , tau = 15)
# Modelling
p = pm.Deterministic('p',tt.exp(alpha*(thau-beta))/(1+tt.exp(alpha*(thau-beta))))
out = pm.Normal('o',p,observed = prob_tt)
Then I infer through the model:
with irt:
mean_field = pm.fit(10000,method='advi', callbacks=[pm.callbacks.CheckParametersConvergence(diff='absolute')])
Finally, Sample from the model to get compute posterior:
pm.plot_posterior(mean_field.sample(1000), color='LightSeaGreen');
But the results of the "alpha" (mean of 2.2) is relatively far from the expected one (4) even though the prior on alpha was well-calibrated.
Would you have an idea of the origin of this gap and how to fix it?
Thanks a lot,

out = pm.Normal('o',p,observed = prob_tt)
Why you are using Normal instead of Bernoulli ? Also, what is the variance of normal ?

Related

How to set "budget" tag for xgboost hyperband optimization with mlr3tuningspaces?

I am trying to tune xgboost with hyperband and I would like to use the suggested default tuning space from the mlr3tuningspaces package. However, I don't find how to tag a hyperparameter with "budget" while using lts .
Below, I reproduced the mlr3hyperband package example to illustrate my issue:
library(mlr3verse)
library(mlr3hyperband)
library(mlr3tuningspaces)
## this does not work, because I don't know how to tag a hyperparameter
## with "budget" while using the suggested tuning space
search_space = lts("classif.xgboost.default")
search_space$values
## this works because it has a hyperparameter (nrounds) tagged with "bugdget"
search_space = ps(
nrounds = p_int(lower = 1, upper = 16, tags = "budget"),
eta = p_dbl(lower = 0, upper = 1),
booster = p_fct(levels = c("gbtree", "gblinear", "dart"))
)
# hyperparameter tuning on the pima indians diabetes data set
instance = tune(
method = "hyperband",
task = tsk("pima"),
learner = lrn("classif.xgboost", eval_metric = "logloss"),
resampling = rsmp("cv", folds = 3),
measures = msr("classif.ce"),
search_space = search_space,
term_evals = 100
)
# best performing hyperparameter configuration
instance$result
Thanks for pointing this out. I will add the budget tag to the default search space. Until then you can use this code.
library(mlr3hyperband)
library(mlr3tuningspaces)
library(mlr3learners)
# get learner with search space in one go
learner = lts(lrn("classif.xgboost"))
# overwrite nrounds with budget tag
learner$param_set$values$nrounds = to_tune(p_int(1000, 5000, tags = "budget"))
instance = tune(
method = "hyperband",
task = tsk("pima"),
learner = learner,
resampling = rsmp("cv", folds = 3),
measures = msr("classif.ce"),
term_evals = 100
)
Update 28.06.2022
The new API in version 0.3.0 is
learner = lts(lrn("classif.xgboost"), nrounds = to_tune(p_int(1000, 5000, tags = "budget"))

Is non-identical not enough to be considered 'distinct' for kmeans centroids?

I have an issue with kmeans clustering providing centroids. I saw the same problem already asked (
K-means: Initial centers are not distinct), but the solution in that post is not working in my case.
I selected the centroids using ClusterR::Kmeans_arma. I confirmed that my centroids are not identical using mgcv::uniquecombs, but still got the initial centers are not distinct error.
> dim(t(dat))
[1] 13540 11553
> centroids = ClusterR::KMeans_arma(data = t(dat), centers = 561,
n_iter = 50, seed_mode = "random_subset",
verbose = FALSE, CENTROIDS = NULL)
> dim(centroids)
[1] 561 11553
> x = mgcv::uniquecombs(centroids)
> dim(x)
[1] 561 11553
> res = kmeans(t(dat), centers = centroids, iter.max = 200)
Error in kmeans(t(dat), centers = centroids, iter.max = 200) :
initial centers are not distinct
Any suggestion to resolve this? Thanks!
I replicated the issue you've mentioned with the following data:
cols = 13540
rows = 11553
set.seed(1)
vec_dat = runif(rows * cols)
dat = matrix(vec_dat, nrow = rows, ncol = cols)
dim(dat)
dat = t(dat)
dim(dat)
There is no 'centers' parameter in the 'ClusterR::KMeans_arma()' function, therefore I've assumed you actually mean 'clusters',
centroids = ClusterR::KMeans_arma(data = dat,
clusters = 561,
n_iter = 50,
seed_mode = "random_subset",
verbose = TRUE,
CENTROIDS = NULL)
str(centroids)
dim(centroids)
The 'centroids' is a matrix of class "k-means clustering". If your intention is to come to the clusters then you can use,
clust = ClusterR::predict_KMeans(data = dat,
CENTROIDS = centroids,
threads = 6)
length(unique(clust)) # 561
class(centroids) # "k-means clustering"
If you want to pass the 'centroids' to the base R 'kmeans' function you have to set the 'class' of the 'centroids' object to NULL and that because the base R 'kmeans' function uses internally the base R 'duplicated()' function (you can view this by using print(kmeans) in the R console) which does not recognize the 'centroids' object as a matrix or data.frame (it is an object of class "k-means clustering") and performs the checking column-wise rather than row-wise. Therefore, the following should work for your case,
class(centroids) = NULL
dups = duplicated(centroids)
sum(dups) # this should actually give 0
res = kmeans(dat, centers = centroids, iter.max = 200)
I've made a few adjustments to the "ClusterR::predict_KMeans()" and particularly I've added the "threads" parameter and a check for duplicates, therefore if you want to come to the clusters using multiple cores you have to install the package from Github using,
remotes::install_github('mlampros/ClusterR',
upgrade = 'always',
dependencies = TRUE,
repos = 'https://cloud.r-project.org/')
The changes will take effect in the next version of the CRAN package which will be "1.2.2"
UPDATE regarding output and performance (based on your comment):
data(dietary_survey_IBS, package = 'ClusterR')
kmeans_arma = function(data) {
km_cl = ClusterR::KMeans_arma(data,
clusters = 2,
n_iter = 10,
seed_mode = "random_subset",
seed = 1)
pred_cl = ClusterR::predict_KMeans(data = data,
CENTROIDS = km_cl,
threads = 1)
return(pred_cl)
}
km_arma = kmeans_arma(data = dietary_survey_IBS)
km_algos = c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen")
for (algo in km_algos) {
cat('base-kmeans-algo:', algo, '\n')
km_base = kmeans(dietary_survey_IBS,
centers = 2,
iter.max = 10,
nstart = 1, # can be set to 5 or 10 etc.
algorithm = algo)
km_cl = as.vector(km_base$cluster)
print(table(km_arma, km_cl))
cat('--------------------------\n')
}
microbenchmark::microbenchmark(kmeans(dietary_survey_IBS,
centers = 2,
iter.max = 10,
nstart = 1, # can be set to 5 or 10 etc.
algorithm = algo), kmeans_arma(data = dietary_survey_IBS), times = 100)
I don't see any significant difference in the output clusters between the 'base R kmeans' and the 'kmeans_arma' function for all available 'base R kmeans' algorithms (you can test it also for your own data sets). I am not sure which algorithm the 'armadillo' library uses internally and moreover the 'base R kmeans' includes the 'nstart' parameter (you can consult the documentation for more info). Regarding performance you won't see any substantial differences for small to medium data sets but due to the fact that the armadillo library uses OpenMP internally in case that your computer has more than 1 cores then for big data sets I think the 'ClusterR::KMeans_arma' function will return the 'centroids' faster.

A Simple Bayesian Network with a Coin-Flipping Problem

I am trying to implement a Bayesian network and solve a regression problem using PYMC3. In my model, I have a fair coin as the parent node. If the parent node is H, the child node selects the normal distribution N(5,0.2); if T, the child selects N(0,0.5). Here is an illustration of my network.
To simulate this network, I generated a sample dataset and tried doing Bayesian regression using the code below. Currently, the model does regression only for the child node as if the parent node does not exist. I would greatly appreciate it if anyone can let me know how to implement the conditional probability P(D|C). Ultimately, I am interested in finding the probability distribution for mu1 and mu2. Thank you!
# Generate data for coin flip P(C) and store in c1
theta_real = 0.5 # unkown value in a real experiment
n_sample = 10
c1 = bernoulli.rvs(p=theta_real, size=n_sample)
# Generate data for normal distribution P(D|C) and store in d1
np.random.seed(123)
mu1 = 0
sigma1 = 0.5
mu2 = 5
sigma2 = 0.2
d1 = []
for index, item in enumerate(c1):
if item == 0:
d1.extend(normal(mu1, sigma1, 1))
else:
d1.extend(normal(mu2, sigma2, 1))
# I start building PYMC3 model here
c1_tensor = theano.shared(np.array(c1))
d1_tensor = theano.shared(np.array(d1))
with pm.Model() as model:
# define prior for c1. I am not sure how to do this.
#c1_present = pm.Categorical('c1',observed=c1_tensor)
# how do I incorporate P(D | C)
mu_prior = pm.Normal('mu', mu=2, sd=2, shape=1)
sigma_prior = pm.HalfNormal('sigma', sd=2, shape=1)
y_likelihood = pm.Normal('y', mu=mu_prior, sd=sigma_prior, observed=d1_tensor)
You could use the Dirichlet distribution as a prior for the coin toss and NormalMixture as the prior of the two Gaussians. In the following snippet I changed the fairness of the coin and increased the number of coin tosses, but you could adjust these in any way want:
import numpy as np
import pymc3 as pm
from scipy.stats import bernoulli
# Generate data for coin flip P(C) and store in c1
theta_real = 0.2 # unkown value in a real experiment
n_sample = 2000
c1 = bernoulli.rvs(p=theta_real, size=n_sample)
# Generate data for normal distribution P(D|C) and store in d1
np.random.seed(123)
mu1 = 0
sigma1 = 0.5
mu2 = 5
sigma2 = 0.2
d1 = []
for index, item in enumerate(c1):
if item == 0:
d1.extend(np.random.normal(mu1, sigma1, 1))
else:
d1.extend(np.random.normal(mu2, sigma2, 1))
with pm.Model() as model:
w = pm.Dirichlet('p', a=np.ones(2))
mu = pm.Normal('mu', 0, 20, shape=2)
sigma = np.array([0.5,0.2])
pm.NormalMixture('like',w=w,mu=mu,sigma=sigma,observed=np.array(d1))
trace = pm.sample()
pm.summary(trace)
This will give you the following:
mean sd mc_error hpd_2.5 hpd_97.5 n_eff Rhat
mu__0 4.981222 0.023900 0.000491 4.935044 5.027420 2643.052184 0.999637
mu__1 -0.007660 0.004946 0.000095 -0.017388 0.001576 2481.146286 1.000312
p__0 0.213976 0.009393 0.000167 0.195602 0.231803 2245.905021 0.999302
p__1 0.786024 0.009393 0.000167 0.768197 0.804398 2245.905021 0.999302
The parameters are recovered nicely as you can also see from the traceplots:
The above implementation will give you the posterior of theta_real, mu1 and mu2 but I could not get convergence when I added sigma1 and sigma2 as parameters to be estimated by the data (even though the prior was quite narrow):
with pm.Model() as model:
w = pm.Dirichlet('p', a=np.ones(2))
mu = pm.Normal('mu', 0, 20, shape=2)
sigma = pm.HalfNormal('sigma', sd=2, shape=2)
pm.NormalMixture('like',w=w,mu=mu,sigma=sigma,observed=np.array(d1))
trace = pm.sample()
print(pm.summary(trace))
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma, mu, p]
Sampling 4 chains: 100%|██████████| 4000/4000 [00:10<00:00, 395.57draws/s]
The acceptance probability does not match the target. It is 0.883057127209148, but should be close to 0.8. Try to increase the number of tuning steps.
The gelman-rubin statistic is larger than 1.4 for some parameters. The sampler did not converge.
The estimated number of effective samples is smaller than 200 for some parameters.
mean sd mc_error ... hpd_97.5 n_eff Rhat
mu__0 1.244021 2.165433 0.216540 ... 5.005507 2.002049 212.596596
mu__1 3.743879 2.165122 0.216510 ... 5.012067 2.002040 235.750129
p__0 0.643069 0.248630 0.024846 ... 0.803369 2.004185 30.966189
p__1 0.356931 0.248630 0.024846 ... 0.798632 2.004185 30.966189
sigma__0 0.416207 0.125435 0.012517 ... 0.504110 2.009031 17.333177
sigma__1 0.271763 0.125539 0.012533 ... 0.497208 2.007779 19.217223
[6 rows x 7 columns]
Based on that you most likely will need to reparametrize if you also wanted to estimate the two standard deviations from this data.
This answer is to supplement #balleveryday's answer, which suggests the Gaussian Mixture Model, but had some trouble getting the symmetry breaking to work. Admittedly, the symmetry breaking in the official example is done in the context of Metropolis-Hastings sampling, whereas I think NUTS might be a little more sensitive to encountering impossible values (not sure). Here's what worked for me:
import numpy as np
import pymc3 as pm
from scipy.stats import bernoulli
import theano.tensor as tt
# everything should reproduce
np.random.seed(123)
n_sample = 2000
# Generate data for coin flip P(C) and store in c1
theta_real = 0.2 # unknown value in a real experiment
c1 = bernoulli.rvs(p=theta_real, size=n_sample)
# Generate data for normal distribution P(D|C) and store in d1
mu1, mu2 = 0, 5
sigma1, sigma2 = 0.5, 0.2
d1 = np.empty_like(c1, dtype=np.float64)
d1[c1 == 0] = np.random.normal(mu1, sigma1, np.sum(c1 == 0))
d1[c1 == 1] = np.random.normal(mu2, sigma2, np.sum(c1 == 1))
with pm.Model() as gmm_asym:
# mixture vector
w = pm.Dirichlet('p', a=np.ones(2))
# Gaussian parameters (testval helps start off ordered)
mu = pm.Normal('mu', 0, 20, shape=2, testval=[-10, 10])
sigma = pm.HalfNormal('sigma', sd=2, shape=2)
# break symmetry, forcing mu[0] < mu[1]
order_means_potential = pm.Potential('order_means_potential',
tt.switch(mu[1] - mu[0] < 0, -np.inf, 0))
# observed
pm.NormalMixture('like', w=w, mu=mu, sigma=sigma, observed=d1)
# reproducible sampling
tr_gmm_asym = pm.sample(tune=2000, target_accept=0.9, random_seed=20191121)
This produces samples with the statistics
mean sd mc_error hpd_2.5 hpd_97.5 n_eff Rhat
mu__0 0.004549 0.011975 0.000226 -0.017398 0.029375 2425.487301 0.999916
mu__1 5.007663 0.008993 0.000166 4.989247 5.024692 2181.134002 0.999563
p__0 0.789983 0.009091 0.000188 0.773059 0.808062 2417.356539 0.999788
p__1 0.210017 0.009091 0.000188 0.191938 0.226941 2417.356539 0.999788
sigma__0 0.497322 0.009103 0.000186 0.480394 0.515867 2227.397854 0.999358
sigma__1 0.191310 0.006633 0.000141 0.178924 0.204859 2286.817037 0.999614
and the traces

Error in prediction in svm classifier after one hot encoding

I have used one hot encoding to my dataset before training my SVM classifier.
which increased number of features in training set to 982. But during
prediction of test dataset which has 7 features i am getting error " X has 7
features per sample; expecting 982". I don't understand how to increase
number of features in test dataset.
My code is:
df = pd.read_csv('train.csv',header=None);
features = df.iloc[:,:-1].values
labels = df.iloc[:,-1].values
encode = LabelEncoder()
features[:,2] = encode.fit_transform(features[:,2])
features[:,3] = encode.fit_transform(features[:,3])
features[:,4] = encode.fit_transform(features[:,4])
features[:,5] = encode.fit_transform(features[:,5])
df1 = pd.DataFrame(features)
#--------------------------- ONE HOT ENCODING --------------------------------#
hotencode = OneHotEncoder(categorical_features=[2])
features = hotencode.fit_transform(features).toarray()
hotencode = OneHotEncoder(categorical_features=[14])
features = hotencode.fit_transform(features).toarray()
hotencode = OneHotEncoder(categorical_features=[37])
features = hotencode.fit_transform(features).toarray()
hotencode = OneHotEncoder(categorical_features=[466])
features = hotencode.fit_transform(features).toarray()
X = np.array(features)
y = np.array(labels)
clf = svm.LinearSVC()
clf.fit(X,y)
d_test = pd.read_csv('query.csv')
Z_test =np.array(d_test)
confidence = clf.predict(Z_test)
print("The query image belongs to Class ")
print(confidence)
######################### test dataset
query.csv
1 0.076 1 3232236298 2886732679 3128 60604
The short answer: you need to apply the same OHE transform (or LE+OHE in your case) on the test set.
For a good advice, see Scikit Learn OneHotEncoder fit and transform Error: ValueError: X has different shape than during fitting or How to deal with imputation and hot one encoding in pandas?

Checking the gradient when doing gradient descent

I'm trying to implement a feed-forward backpropagating autoencoder (training with gradient descent) and wanted to verify that I'm calculating the gradient correctly. This tutorial suggests calculating the derivative of each parameter one at a time: grad_i(theta) = (J(theta_i+epsilon) - J(theta_i-epsilon)) / (2*epsilon). I've written a sample piece of code in Matlab to do just this, but without much luck -- the differences between the gradient calculated from the derivative and the gradient numerically found tend to be largish (>> 4 significant figures).
If anyone can offer any suggestions, I would greatly appreciate the help (either in my calculation of the gradient or how I perform the check). Because I've simplified the code greatly to make it more readable, I haven't included a biases, and am no longer tying the weight matrices.
First, I initialize the variables:
numHidden = 200;
numVisible = 784;
low = -4*sqrt(6./(numHidden + numVisible));
high = 4*sqrt(6./(numHidden + numVisible));
encoder = low + (high-low)*rand(numVisible, numHidden);
decoder = low + (high-low)*rand(numHidden, numVisible);
Next, given some input image x, do feed-forward propagation:
a = sigmoid(x*encoder);
z = sigmoid(a*decoder); % (reconstruction of x)
The loss function I'm using is the standard Σ(0.5*(z - x)^2)):
% first calculate the error by finding the derivative of sum(0.5*(z-x).^2),
% which is (f(h)-x)*f'(h), where z = f(h), h = a*decoder, and
% f = sigmoid(x). However, since the derivative of the sigmoid is
% sigmoid*(1 - sigmoid), we get:
error_0 = (z - x).*z.*(1-z);
% The gradient \Delta w_{ji} = error_j*a_i
gDecoder = error_0'*a;
% not important, but included for completeness
% do back-propagation one layer down
error_1 = (error_0*encoder).*a.*(1-a);
gEncoder = error_1'*x;
And finally, check that the gradient is correct (in this case, just do it for the decoder):
epsilon = 10e-5;
check = gDecoder(:); % the values we obtained above
for i = 1:size(decoder(:), 1)
% calculate J+
theta = decoder(:); % unroll
theta(i) = theta(i) + epsilon;
decoderp = reshape(theta, size(decoder)); % re-roll
a = sigmoid(x*encoder);
z = sigmoid(a*decoderp);
Jp = sum(0.5*(z - x).^2);
% calculate J-
theta = decoder(:);
theta(i) = theta(i) - epsilon;
decoderp = reshape(theta, size(decoder));
a = sigmoid(x*encoder);
z = sigmoid(a*decoderp);
Jm = sum(0.5*(z - x).^2);
grad_i = (Jp - Jm) / (2*epsilon);
diff = abs(grad_i - check(i));
fprintf('%d: %f <=> %f: %f\n', i, grad_i, check(i), diff);
end
Running this on the MNIST dataset (for the first entry) gives results such as:
2: 0.093885 <=> 0.028398: 0.065487
3: 0.066285 <=> 0.031096: 0.035189
5: 0.053074 <=> 0.019839: 0.033235
6: 0.108249 <=> 0.042407: 0.065843
7: 0.091576 <=> 0.009014: 0.082562
Do not sigmoid on both a and z. Just use it on z.
a = x*encoder;
z = sigmoid(a*decoderp);