Multiple paired Wilcoxon signed rank tests between vectors, with medians of difference, confidence intervals and p-values compiled in a R data frame? - dataframe

Several hours spent trying to find a solution but without success. Does anyone know how to apply a wilcoxon-signed rank test (paired = TRUE) of v2, v3, v4 versus v1 (so v1 is the reference vector) in such a way that the following results are compiled in a R dataframe like that:
# groups (pseudo)median conf.int_low conf.int_high p.value
# v1 vs v2 6.41 4.645 8.245 1.335e-05
# v1 vs v3 21.2875 20.270 22.125 1.907e-06
# v1 vs v4 -1.899768 -2.725023 -1.349986 9.542e-05
The above results come from:
wilcox.test(df$v1, df$v2, paired = TRUE, conf.int = TRUE)
wilcox.test(df$v1, df$v3, paired = TRUE, conf.int = TRUE)
wilcox.test(df$v1, df$v3, paired = TRUE, conf.int = TRUE)
The data are:
df <-
structure(list(
v1 = c(280, 237.48, 235.7, 250.3, 242.9, 244.76, 245.74, 244.4, 246.24, 242.3, 239.64, 245.88, 247, 247.7, 242.86, 244.99, 234.52, 241.9, 244.99, 221.85),
v2 = c(284.39, 231.79, 226.53, 250.2, 237.05, 237.05, 239.68, 239.68, 237.05, 237.05, 234.42, 239.68, 231.79, 244.94, 231.79, 237.05, 226.53, 239.68, 239.68, 208.12),
v3 = c(256.43, 215.81, 215.86, 231.35, 221.7, 222.51, 222.7, 222.26, 225.59, 220.95, 218.43, 221.14, 224.53, 225.68, 224.79, 222.54, 215.08, 219.96, 225.4, 210.99),
v4 = c(282.85, 238.43, 239.2, 252.75, 243.8, 245.81, 247.14, 246.3, 247.09, 243.85, 240.79, 247.18, 248.95, 248.6, 246.41, 247.04, 241.42, 242.4, 247.44, 234.35)),
row.names = c(NA, -20L), class = "data.frame")
Thanks for help!

Related

Bayesian IRT Pymc3 - Parameter inference

I would like to estimate IRT model using PyMC3.
I generated data with the following distribution:
alpha_fix = 4
beta_fix = 100
theta= np.random.normal(100,15,1000)
prob = np.exp(alpha_fix*(theta-beta_fix))/(1+np.exp(alpha_fix*(theta-beta_fix)))
prob_tt = tt._shared(prob)
Then I created a model using PyMC3 to infer the parameter:
irt = pm.Model()
with irt:
# Priors
alpha = pm.Normal('alpha',mu = 4 , tau = 1)
beta = pm.Normal('beta',mu = 100 , tau = 15)
thau = pm.Normal('thau' ,mu = 100 , tau = 15)
# Modelling
p = pm.Deterministic('p',tt.exp(alpha*(thau-beta))/(1+tt.exp(alpha*(thau-beta))))
out = pm.Normal('o',p,observed = prob_tt)
Then I infer through the model:
with irt:
mean_field = pm.fit(10000,method='advi', callbacks=[pm.callbacks.CheckParametersConvergence(diff='absolute')])
Finally, Sample from the model to get compute posterior:
pm.plot_posterior(mean_field.sample(1000), color='LightSeaGreen');
But the results of the "alpha" (mean of 2.2) is relatively far from the expected one (4) even though the prior on alpha was well-calibrated.
Would you have an idea of the origin of this gap and how to fix it?
Thanks a lot,
out = pm.Normal('o',p,observed = prob_tt)
Why you are using Normal instead of Bernoulli ? Also, what is the variance of normal ?

Is non-identical not enough to be considered 'distinct' for kmeans centroids?

I have an issue with kmeans clustering providing centroids. I saw the same problem already asked (
K-means: Initial centers are not distinct), but the solution in that post is not working in my case.
I selected the centroids using ClusterR::Kmeans_arma. I confirmed that my centroids are not identical using mgcv::uniquecombs, but still got the initial centers are not distinct error.
> dim(t(dat))
[1] 13540 11553
> centroids = ClusterR::KMeans_arma(data = t(dat), centers = 561,
n_iter = 50, seed_mode = "random_subset",
verbose = FALSE, CENTROIDS = NULL)
> dim(centroids)
[1] 561 11553
> x = mgcv::uniquecombs(centroids)
> dim(x)
[1] 561 11553
> res = kmeans(t(dat), centers = centroids, iter.max = 200)
Error in kmeans(t(dat), centers = centroids, iter.max = 200) :
initial centers are not distinct
Any suggestion to resolve this? Thanks!
I replicated the issue you've mentioned with the following data:
cols = 13540
rows = 11553
set.seed(1)
vec_dat = runif(rows * cols)
dat = matrix(vec_dat, nrow = rows, ncol = cols)
dim(dat)
dat = t(dat)
dim(dat)
There is no 'centers' parameter in the 'ClusterR::KMeans_arma()' function, therefore I've assumed you actually mean 'clusters',
centroids = ClusterR::KMeans_arma(data = dat,
clusters = 561,
n_iter = 50,
seed_mode = "random_subset",
verbose = TRUE,
CENTROIDS = NULL)
str(centroids)
dim(centroids)
The 'centroids' is a matrix of class "k-means clustering". If your intention is to come to the clusters then you can use,
clust = ClusterR::predict_KMeans(data = dat,
CENTROIDS = centroids,
threads = 6)
length(unique(clust)) # 561
class(centroids) # "k-means clustering"
If you want to pass the 'centroids' to the base R 'kmeans' function you have to set the 'class' of the 'centroids' object to NULL and that because the base R 'kmeans' function uses internally the base R 'duplicated()' function (you can view this by using print(kmeans) in the R console) which does not recognize the 'centroids' object as a matrix or data.frame (it is an object of class "k-means clustering") and performs the checking column-wise rather than row-wise. Therefore, the following should work for your case,
class(centroids) = NULL
dups = duplicated(centroids)
sum(dups) # this should actually give 0
res = kmeans(dat, centers = centroids, iter.max = 200)
I've made a few adjustments to the "ClusterR::predict_KMeans()" and particularly I've added the "threads" parameter and a check for duplicates, therefore if you want to come to the clusters using multiple cores you have to install the package from Github using,
remotes::install_github('mlampros/ClusterR',
upgrade = 'always',
dependencies = TRUE,
repos = 'https://cloud.r-project.org/')
The changes will take effect in the next version of the CRAN package which will be "1.2.2"
UPDATE regarding output and performance (based on your comment):
data(dietary_survey_IBS, package = 'ClusterR')
kmeans_arma = function(data) {
km_cl = ClusterR::KMeans_arma(data,
clusters = 2,
n_iter = 10,
seed_mode = "random_subset",
seed = 1)
pred_cl = ClusterR::predict_KMeans(data = data,
CENTROIDS = km_cl,
threads = 1)
return(pred_cl)
}
km_arma = kmeans_arma(data = dietary_survey_IBS)
km_algos = c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen")
for (algo in km_algos) {
cat('base-kmeans-algo:', algo, '\n')
km_base = kmeans(dietary_survey_IBS,
centers = 2,
iter.max = 10,
nstart = 1, # can be set to 5 or 10 etc.
algorithm = algo)
km_cl = as.vector(km_base$cluster)
print(table(km_arma, km_cl))
cat('--------------------------\n')
}
microbenchmark::microbenchmark(kmeans(dietary_survey_IBS,
centers = 2,
iter.max = 10,
nstart = 1, # can be set to 5 or 10 etc.
algorithm = algo), kmeans_arma(data = dietary_survey_IBS), times = 100)
I don't see any significant difference in the output clusters between the 'base R kmeans' and the 'kmeans_arma' function for all available 'base R kmeans' algorithms (you can test it also for your own data sets). I am not sure which algorithm the 'armadillo' library uses internally and moreover the 'base R kmeans' includes the 'nstart' parameter (you can consult the documentation for more info). Regarding performance you won't see any substantial differences for small to medium data sets but due to the fact that the armadillo library uses OpenMP internally in case that your computer has more than 1 cores then for big data sets I think the 'ClusterR::KMeans_arma' function will return the 'centroids' faster.

A Simple Bayesian Network with a Coin-Flipping Problem

I am trying to implement a Bayesian network and solve a regression problem using PYMC3. In my model, I have a fair coin as the parent node. If the parent node is H, the child node selects the normal distribution N(5,0.2); if T, the child selects N(0,0.5). Here is an illustration of my network.
To simulate this network, I generated a sample dataset and tried doing Bayesian regression using the code below. Currently, the model does regression only for the child node as if the parent node does not exist. I would greatly appreciate it if anyone can let me know how to implement the conditional probability P(D|C). Ultimately, I am interested in finding the probability distribution for mu1 and mu2. Thank you!
# Generate data for coin flip P(C) and store in c1
theta_real = 0.5 # unkown value in a real experiment
n_sample = 10
c1 = bernoulli.rvs(p=theta_real, size=n_sample)
# Generate data for normal distribution P(D|C) and store in d1
np.random.seed(123)
mu1 = 0
sigma1 = 0.5
mu2 = 5
sigma2 = 0.2
d1 = []
for index, item in enumerate(c1):
if item == 0:
d1.extend(normal(mu1, sigma1, 1))
else:
d1.extend(normal(mu2, sigma2, 1))
# I start building PYMC3 model here
c1_tensor = theano.shared(np.array(c1))
d1_tensor = theano.shared(np.array(d1))
with pm.Model() as model:
# define prior for c1. I am not sure how to do this.
#c1_present = pm.Categorical('c1',observed=c1_tensor)
# how do I incorporate P(D | C)
mu_prior = pm.Normal('mu', mu=2, sd=2, shape=1)
sigma_prior = pm.HalfNormal('sigma', sd=2, shape=1)
y_likelihood = pm.Normal('y', mu=mu_prior, sd=sigma_prior, observed=d1_tensor)
You could use the Dirichlet distribution as a prior for the coin toss and NormalMixture as the prior of the two Gaussians. In the following snippet I changed the fairness of the coin and increased the number of coin tosses, but you could adjust these in any way want:
import numpy as np
import pymc3 as pm
from scipy.stats import bernoulli
# Generate data for coin flip P(C) and store in c1
theta_real = 0.2 # unkown value in a real experiment
n_sample = 2000
c1 = bernoulli.rvs(p=theta_real, size=n_sample)
# Generate data for normal distribution P(D|C) and store in d1
np.random.seed(123)
mu1 = 0
sigma1 = 0.5
mu2 = 5
sigma2 = 0.2
d1 = []
for index, item in enumerate(c1):
if item == 0:
d1.extend(np.random.normal(mu1, sigma1, 1))
else:
d1.extend(np.random.normal(mu2, sigma2, 1))
with pm.Model() as model:
w = pm.Dirichlet('p', a=np.ones(2))
mu = pm.Normal('mu', 0, 20, shape=2)
sigma = np.array([0.5,0.2])
pm.NormalMixture('like',w=w,mu=mu,sigma=sigma,observed=np.array(d1))
trace = pm.sample()
pm.summary(trace)
This will give you the following:
mean sd mc_error hpd_2.5 hpd_97.5 n_eff Rhat
mu__0 4.981222 0.023900 0.000491 4.935044 5.027420 2643.052184 0.999637
mu__1 -0.007660 0.004946 0.000095 -0.017388 0.001576 2481.146286 1.000312
p__0 0.213976 0.009393 0.000167 0.195602 0.231803 2245.905021 0.999302
p__1 0.786024 0.009393 0.000167 0.768197 0.804398 2245.905021 0.999302
The parameters are recovered nicely as you can also see from the traceplots:
The above implementation will give you the posterior of theta_real, mu1 and mu2 but I could not get convergence when I added sigma1 and sigma2 as parameters to be estimated by the data (even though the prior was quite narrow):
with pm.Model() as model:
w = pm.Dirichlet('p', a=np.ones(2))
mu = pm.Normal('mu', 0, 20, shape=2)
sigma = pm.HalfNormal('sigma', sd=2, shape=2)
pm.NormalMixture('like',w=w,mu=mu,sigma=sigma,observed=np.array(d1))
trace = pm.sample()
print(pm.summary(trace))
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma, mu, p]
Sampling 4 chains: 100%|██████████| 4000/4000 [00:10<00:00, 395.57draws/s]
The acceptance probability does not match the target. It is 0.883057127209148, but should be close to 0.8. Try to increase the number of tuning steps.
The gelman-rubin statistic is larger than 1.4 for some parameters. The sampler did not converge.
The estimated number of effective samples is smaller than 200 for some parameters.
mean sd mc_error ... hpd_97.5 n_eff Rhat
mu__0 1.244021 2.165433 0.216540 ... 5.005507 2.002049 212.596596
mu__1 3.743879 2.165122 0.216510 ... 5.012067 2.002040 235.750129
p__0 0.643069 0.248630 0.024846 ... 0.803369 2.004185 30.966189
p__1 0.356931 0.248630 0.024846 ... 0.798632 2.004185 30.966189
sigma__0 0.416207 0.125435 0.012517 ... 0.504110 2.009031 17.333177
sigma__1 0.271763 0.125539 0.012533 ... 0.497208 2.007779 19.217223
[6 rows x 7 columns]
Based on that you most likely will need to reparametrize if you also wanted to estimate the two standard deviations from this data.
This answer is to supplement #balleveryday's answer, which suggests the Gaussian Mixture Model, but had some trouble getting the symmetry breaking to work. Admittedly, the symmetry breaking in the official example is done in the context of Metropolis-Hastings sampling, whereas I think NUTS might be a little more sensitive to encountering impossible values (not sure). Here's what worked for me:
import numpy as np
import pymc3 as pm
from scipy.stats import bernoulli
import theano.tensor as tt
# everything should reproduce
np.random.seed(123)
n_sample = 2000
# Generate data for coin flip P(C) and store in c1
theta_real = 0.2 # unknown value in a real experiment
c1 = bernoulli.rvs(p=theta_real, size=n_sample)
# Generate data for normal distribution P(D|C) and store in d1
mu1, mu2 = 0, 5
sigma1, sigma2 = 0.5, 0.2
d1 = np.empty_like(c1, dtype=np.float64)
d1[c1 == 0] = np.random.normal(mu1, sigma1, np.sum(c1 == 0))
d1[c1 == 1] = np.random.normal(mu2, sigma2, np.sum(c1 == 1))
with pm.Model() as gmm_asym:
# mixture vector
w = pm.Dirichlet('p', a=np.ones(2))
# Gaussian parameters (testval helps start off ordered)
mu = pm.Normal('mu', 0, 20, shape=2, testval=[-10, 10])
sigma = pm.HalfNormal('sigma', sd=2, shape=2)
# break symmetry, forcing mu[0] < mu[1]
order_means_potential = pm.Potential('order_means_potential',
tt.switch(mu[1] - mu[0] < 0, -np.inf, 0))
# observed
pm.NormalMixture('like', w=w, mu=mu, sigma=sigma, observed=d1)
# reproducible sampling
tr_gmm_asym = pm.sample(tune=2000, target_accept=0.9, random_seed=20191121)
This produces samples with the statistics
mean sd mc_error hpd_2.5 hpd_97.5 n_eff Rhat
mu__0 0.004549 0.011975 0.000226 -0.017398 0.029375 2425.487301 0.999916
mu__1 5.007663 0.008993 0.000166 4.989247 5.024692 2181.134002 0.999563
p__0 0.789983 0.009091 0.000188 0.773059 0.808062 2417.356539 0.999788
p__1 0.210017 0.009091 0.000188 0.191938 0.226941 2417.356539 0.999788
sigma__0 0.497322 0.009103 0.000186 0.480394 0.515867 2227.397854 0.999358
sigma__1 0.191310 0.006633 0.000141 0.178924 0.204859 2286.817037 0.999614
and the traces

Hierachical Bayesian Linear Regression using PyMC3 is super slow

I am trying to write some code for implementing HBM in the case of logistic regression using the adults dataset from the UCI repository.
I have already written the code, but sampling is super slow, on the order of 107s per sample, for even 64 dimensions or features. Am I doing something wrong?
I am attaching the code for reference. I also rescaled the data thanks to suggestions to try to speed it up, to no avail.
I appreciate any feedback.
The code is a mixture of what has been written here and here.
#re loading the dataset this time without converting the country into one-hot vector rather for hierarchical modeling
adult_df = pd.read_csv('adult.data', header=None, sep=', ', )
adult_df.columns = ["Age", "WorkClass", "fnlwgt", "Education", "EducationNum",
"MaritalStatus", "Occupation", "Relationship", "Race", "Gender",
"CapitalGain", "CapitalLoss", "HoursPerWeek", "NativeCountry", "Income"]
adult_df["Income"] = adult_df["Income"].map({ "<=50K": 0, ">50K": 1 })
adult_df.drop("CapitalGain", axis=1, inplace=True,)
adult_df.drop("CapitalLoss", axis=1, inplace=True,)
adult_df.Age = adult_df.Age.astype(float)
adult_df.fnlwgt = adult_df.fnlwgt.astype(float)
adult_df.EducationNum = adult_df.EducationNum.astype(float)
adult_df.HoursPerWeek = adult_df.HoursPerWeek.astype(float)
# dropping native country here!!
adult_df = pd.get_dummies(adult_df, columns=[
"WorkClass", "Education", "MaritalStatus", "Occupation", "Relationship",
"Race", "Gender",
])
standard_scaler_cols = ["Age", "fnlwgt", "EducationNum", "HoursPerWeek",]
other_cols = list(set(adult_df.columns) - set(standard_scaler_cols))
mapper = DataFrameMapper(
[([col,], StandardScaler(),) for col in standard_scaler_cols] +
[(col, None,) for col in other_cols]
)
le = preprocessing.LabelEncoder()
country_idx = le.fit_transform(adult_df['NativeCountry'])
pd.value_counts(pd.Series(y_all))
y_all = adult_df["Income"].values
adult_df.drop("Income", axis=1, inplace=True,)
adult_df.drop("NativeCountry", axis=1, inplace=True,)
n_countries = len(set(country_idx))
n_features = len(adult_df.columns)
min_max_scaler = preprocessing.MinMaxScaler()
adult_df = min_max_scaler.fit_transform(adult_df)
X_train, X_test, y_train, y_test, country_idx_train, country_idx_test = train_test_split(adult_df, y_all, country_idx, train_size=0.1, test_size=0.25, stratify=y_all, random_state=rs)
with pm.Model() as multilevel_model:
# Hyperiors for intercept
mu_theta = pm.MvNormal(name='mu_a', mu=np.zeros(n_features), cov=np.eye(n_features), shape=n_features)
packed_L_theta = pm.LKJCholeskyCov('packed_L', n=n_features,
eta=2., sd_dist=pm.HalfCauchy.dist(2.5))
L_theta = pm.expand_packed_triangular(n_features, packed_L_theta)
theta = pm.MvNormal(mu=mu_theta, name='mu_theta', chol=L_theta, shape=[n_countries, n_features])
# Hyperiors for intercept (Comment 1)
mu_b = pm.StudentT('mu_b', nu=3, mu=0., sd=1.0)
sigma_b = pm.HalfNormal('sigma_b', sd=1.0)
b = pm.Normal('b', mu=mu_b, sd=sigma_b, shape=[n_countries, 1])
# Calculate predictions given values
# for intercept and slope
yhat = pm.invlogit(b[country_idx_train] + pm.math.dot(theta[country_idx_train], np.asarray(X_train).T))
#Make predictions fit reality
y = pm.Binomial('y', n=np.ones(y_train.shape[0]), p=yhat, observed=y_train)
You will probably have more success on our discourse with pymc3 questions: https://discourse.pymc.io/ I invite you to move your question there.
The first thing I would check is if your Theano is compiling against MKL libraries, or maybe even using Python mode. If you installed things via conda, that should give you MKL, if you're using pip it might be more difficult. http://deeplearning.net/software/theano/troubleshooting.html#test-blas

Reflecting boundary conditions in FiPy

I am attempting to solve the convection diffusion equation in FiPy. For the moment, all I am trying to achieve is a Neumann boundary condition, so that the wave reflects back at the right-hand boundary rather than travelling out of the domain.
I have added the following line:
phi.faceGrad.constrain(0, mesh.exteriorFaces)
But this doesn't seem to change anything.
Am I imposing the wrong boundary condition? Am I imposing it incorrectly? I have searched for this, but can't seem to find an example which has the simple property of a wave reflecting off a boundary! My code is below. Thanks so much.
from fipy import *
nx = 100
L = 1.
dx = L/nx
steps = 160
dt = 0.1
t = dt * steps
mesh = Grid1D(nx=nx, dx=dx)
x = mesh.cellCenters[0]
phi = CellVariable(name="solution variable", mesh=mesh, value=0.)
phi.setValue(1., where=(x>0.03) & (x<0.09))
# Diffusion and convection coefficients
D = FaceVariable(name='diffusion coefficient',mesh=mesh,value=1.*10**(-4.))
C = (0.1,)
# Boundary conditions
phi.faceGrad.constrain(0, mesh.exteriorFaces)
eq = TransientTerm() == DiffusionTerm(coeff=D) - ConvectionTerm(coeff=C)
for step in range(steps):
eq.solve(var=phi, dt=dt)
if step%20==0:
viewer = Viewer(vars=phi, datamin=0., datamax=1.)
viewer.plot()