Is non-identical not enough to be considered 'distinct' for kmeans centroids? - k-means

I have an issue with kmeans clustering providing centroids. I saw the same problem already asked (
K-means: Initial centers are not distinct), but the solution in that post is not working in my case.
I selected the centroids using ClusterR::Kmeans_arma. I confirmed that my centroids are not identical using mgcv::uniquecombs, but still got the initial centers are not distinct error.
> dim(t(dat))
[1] 13540 11553
> centroids = ClusterR::KMeans_arma(data = t(dat), centers = 561,
n_iter = 50, seed_mode = "random_subset",
verbose = FALSE, CENTROIDS = NULL)
> dim(centroids)
[1] 561 11553
> x = mgcv::uniquecombs(centroids)
> dim(x)
[1] 561 11553
> res = kmeans(t(dat), centers = centroids, iter.max = 200)
Error in kmeans(t(dat), centers = centroids, iter.max = 200) :
initial centers are not distinct
Any suggestion to resolve this? Thanks!

I replicated the issue you've mentioned with the following data:
cols = 13540
rows = 11553
set.seed(1)
vec_dat = runif(rows * cols)
dat = matrix(vec_dat, nrow = rows, ncol = cols)
dim(dat)
dat = t(dat)
dim(dat)
There is no 'centers' parameter in the 'ClusterR::KMeans_arma()' function, therefore I've assumed you actually mean 'clusters',
centroids = ClusterR::KMeans_arma(data = dat,
clusters = 561,
n_iter = 50,
seed_mode = "random_subset",
verbose = TRUE,
CENTROIDS = NULL)
str(centroids)
dim(centroids)
The 'centroids' is a matrix of class "k-means clustering". If your intention is to come to the clusters then you can use,
clust = ClusterR::predict_KMeans(data = dat,
CENTROIDS = centroids,
threads = 6)
length(unique(clust)) # 561
class(centroids) # "k-means clustering"
If you want to pass the 'centroids' to the base R 'kmeans' function you have to set the 'class' of the 'centroids' object to NULL and that because the base R 'kmeans' function uses internally the base R 'duplicated()' function (you can view this by using print(kmeans) in the R console) which does not recognize the 'centroids' object as a matrix or data.frame (it is an object of class "k-means clustering") and performs the checking column-wise rather than row-wise. Therefore, the following should work for your case,
class(centroids) = NULL
dups = duplicated(centroids)
sum(dups) # this should actually give 0
res = kmeans(dat, centers = centroids, iter.max = 200)
I've made a few adjustments to the "ClusterR::predict_KMeans()" and particularly I've added the "threads" parameter and a check for duplicates, therefore if you want to come to the clusters using multiple cores you have to install the package from Github using,
remotes::install_github('mlampros/ClusterR',
upgrade = 'always',
dependencies = TRUE,
repos = 'https://cloud.r-project.org/')
The changes will take effect in the next version of the CRAN package which will be "1.2.2"
UPDATE regarding output and performance (based on your comment):
data(dietary_survey_IBS, package = 'ClusterR')
kmeans_arma = function(data) {
km_cl = ClusterR::KMeans_arma(data,
clusters = 2,
n_iter = 10,
seed_mode = "random_subset",
seed = 1)
pred_cl = ClusterR::predict_KMeans(data = data,
CENTROIDS = km_cl,
threads = 1)
return(pred_cl)
}
km_arma = kmeans_arma(data = dietary_survey_IBS)
km_algos = c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen")
for (algo in km_algos) {
cat('base-kmeans-algo:', algo, '\n')
km_base = kmeans(dietary_survey_IBS,
centers = 2,
iter.max = 10,
nstart = 1, # can be set to 5 or 10 etc.
algorithm = algo)
km_cl = as.vector(km_base$cluster)
print(table(km_arma, km_cl))
cat('--------------------------\n')
}
microbenchmark::microbenchmark(kmeans(dietary_survey_IBS,
centers = 2,
iter.max = 10,
nstart = 1, # can be set to 5 or 10 etc.
algorithm = algo), kmeans_arma(data = dietary_survey_IBS), times = 100)
I don't see any significant difference in the output clusters between the 'base R kmeans' and the 'kmeans_arma' function for all available 'base R kmeans' algorithms (you can test it also for your own data sets). I am not sure which algorithm the 'armadillo' library uses internally and moreover the 'base R kmeans' includes the 'nstart' parameter (you can consult the documentation for more info). Regarding performance you won't see any substantial differences for small to medium data sets but due to the fact that the armadillo library uses OpenMP internally in case that your computer has more than 1 cores then for big data sets I think the 'ClusterR::KMeans_arma' function will return the 'centroids' faster.

Related

Trouble writing OptimizationFunction for automatic forward differentiation during Parameter Estimation of an ODEProblem

I am trying to learn Julia for its potential use in parameter estimation. I am interested in estimating kinetic parameters of chemical reactions, which usually involves optimizing reaction parameters with multiple independent batches of experiments. I have successfully optimized a single batch, but need to expand the problem to use many different batches. In developing a sample problem, I am trying to optimize using two toy batches. I know there are probably smarter ways to do this (subject of a future question), but my current workflow involves calling an ODEProblem for each batch, calculating its loss against the data, and minimizing the sum of the residuals for the two batches. Unfortunately, I get an error when initiating the optimization with Optimization.jl. The current code and error are shown below:
using DifferentialEquations, Plots, DiffEqParamEstim
using Optimization, ForwardDiff, OptimizationOptimJL, OptimizationNLopt
using Ipopt, OptimizationGCMAES, Optimisers
using Random
#Experimental data, species B is NOT observed in the data
times = [0.0, 0.071875, 0.143750, 0.215625, 0.287500, 0.359375, 0.431250,
0.503125, 0.575000, 0.646875, 0.718750, 0.790625, 0.862500,
0.934375, 1.006250, 1.078125, 1.150000]
A_obs = [1.0, 0.552208, 0.300598, 0.196879, 0.101175, 0.065684, 0.045096,
0.028880, 0.018433, 0.011509, 0.006215, 0.004278, 0.002698,
0.001944, 0.001116, 0.000732, 0.000426]
C_obs = [0.0, 0.187768, 0.262406, 0.350412, 0.325110, 0.367181, 0.348264,
0.325085, 0.355673, 0.361805, 0.363117, 0.327266, 0.330211,
0.385798, 0.358132, 0.380497, 0.383051]
P_obs = [0.0, 0.117684, 0.175074, 0.236679, 0.234442, 0.270303, 0.272637,
0.274075, 0.278981, 0.297151, 0.297797, 0.298722, 0.326645,
0.303198, 0.277822, 0.284194, 0.301471]
#Create additional data sets for a multi data set optimization
#Simple noise added to data for testing
times_2 = times[2:end] .+ rand(range(-0.05,0.05,100))
P_obs_2 = P_obs[2:end] .+ rand(range(-0.05,0.05,100))
A_obs_2 = A_obs[2:end].+ rand(range(-0.05,0.05,100))
C_obs_2 = C_obs[2:end].+ rand(range(-0.05,0.05,100))
#ki = [2.78E+00, 1.00E-09, 1.97E-01, 3.04E+00, 2.15E+00, 5.27E-01] #Target optimized parameters
ki = [0.1, 0.1, 0.1, 0.1, 0.1, 0.1] #Initial guess of parameters
IC = [1.0, 0.0, 0.0, 0.0] #Initial condition for each species
tspan1 = (minimum(times),maximum(times)) #tuple timespan of data set 1
tspan2 = (minimum(times_2),maximum(times_2)) #tuple timespan of data set 2
# data = VectorOfArray([A_obs,C_obs,P_obs])'
data = vcat(A_obs',C_obs',P_obs') #Make multidimensional array containing all observed data for dataset1, transpose to match shape of ODEProblem output
data2 = vcat(A_obs_2',C_obs_2',P_obs_2') #Make multidimensional array containing all observed data for dataset2, transpose to match shape of ODEProblem output
#make dictionary containing data, time, and initial conditions
keys1 = ["A","B"]
keys2 = ["time","obs","IC"]
entryA =[times,data,IC]
entryB = [times_2, data2,IC]
nest=[Dict(zip(keys2,entryA)),Dict(zip(keys2,entryB))]
exp_dict = Dict(zip(keys1,nest)) #data dictionary
#rate equations in power law form r = k [A][B]
function rxn(x, k)
A = x[1]
B = x[2]
C = x[3]
P = x[4]
k1 = k[1]
k2 = k[2]
k3 = k[3]
k4 = k[4]
k5 = k[5]
k6 = k[6]
r1 = k1 * A
r2 = k2 * A * B
r3 = k3 * C * B
r4 = k4 * A
r5 = k5 * A
r6 = k6 * A * B
return [r1, r2, r3, r4, r5, r6] #returns reaction rate of each equation
end
#Mass balance differential equations
function mass_balances(di,x,args,t)
k = args
r = rxn(x, k)
di[1] = - r[1] - r[2] - r[4] - r[5] - r[6] #Species A
di[2] = + r[1] - r[2] - r[3] - r[6] #Species B
di[3] = + r[2] - r[3] + r[4] #Species C
di[4] = + r[3] + r[5] + r[6] #Species P
end
function ODESols(time,uo,parms)
time_init = (minimum(time),maximum(time))
prob = ODEProblem(mass_balances,uo,time_init,parms)
sol = solve(prob, Tsit5(), reltol=1e-8, abstol=1e-8,save_idxs = [1,3,4],saveat=time) #Integrate prob
return sol
end
function cost_function(data_dict,parms)
res_dict = Dict(zip(keys(data_dict),[0.0,0.0]))
for key in keys(data_dict)
pred = ODESols(data_dict[key]["time"],data_dict[key]["IC"],parms)
loss = L2Loss(data_dict[key]["time"],data_dict[key]["obs"])
err = loss(pred)
res_dict[key] = err
end
residual = sum(res_dict[key] for key in keys(res_dict))
#show typeof(residual)
return residual
end
lb = [0.0,0.0,0.0,0.0,0.0,0.0] #parameter lower bounds
ub = [10.0,10.0,10.0,10.0,10.0,10.0] #parameter upper bounds
optfun = Optimization.OptimizationFunction(cost_function,Optimization.AutoForwardDiff())
optprob = Optimization.OptimizationProblem(optfun,exp_dict, ki,lb=lb,ub=ub,reltol=1E-8) #Set up optimization problem
optsol=solve(optprob, BFGS(),maxiters=10000) #Solve optimization problem
println(optsol.u) #print solution
when I call optsol I get the error:
ERROR: MethodError: no method matching ForwardDiff.GradientConfig(::Optimization.var"#89#106"{OptimizationFunction{true, Optimization.AutoForwardDiff{nothing}, typeof(cost_function), Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED_NO_TIME), Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing}, Vector{Float64}}, ::Dict{String, Dict{String, Array{Float64}}}, ::ForwardDiff.Chunk{2})
Searching online suggests that the issue may be that my cost_function function is not generic enough for ForwardDiff to handle, however I am not sure how to identify where the issue is in this function, or whether it is related to the functions (mass_balances and rxn) that are called within cost_function. Another potential issue is that I am not calling the functions appropriately when building the OptimizationFunction or the OpptimizationProblem, but I cannot identify the issue here either.
Thank you for any suggestions and your help in troubleshooting this application!
res_dict = Dict(zip(keys(data_dict),[0.0,0.0]))
This dictionary is declared to the wrong type.
zerotype = zero(params[1])
res_dict = Dict(zip(keys(data_dict),[zerotype ,zerotype]))
or
res_dict = Dict(zip(keys(data_dict),zeros(eltype(params),2)))
Either way, you want your intermediate calculations to match the type of params when using AutoForwardDiff().
In addition to the variable type specification suggested by Chris, my model also had an issue with the order of the arguments of cost_function and how I passed the arguments to the problem in optprob. This solution was shown by Contradict here

How to set "budget" tag for xgboost hyperband optimization with mlr3tuningspaces?

I am trying to tune xgboost with hyperband and I would like to use the suggested default tuning space from the mlr3tuningspaces package. However, I don't find how to tag a hyperparameter with "budget" while using lts .
Below, I reproduced the mlr3hyperband package example to illustrate my issue:
library(mlr3verse)
library(mlr3hyperband)
library(mlr3tuningspaces)
## this does not work, because I don't know how to tag a hyperparameter
## with "budget" while using the suggested tuning space
search_space = lts("classif.xgboost.default")
search_space$values
## this works because it has a hyperparameter (nrounds) tagged with "bugdget"
search_space = ps(
nrounds = p_int(lower = 1, upper = 16, tags = "budget"),
eta = p_dbl(lower = 0, upper = 1),
booster = p_fct(levels = c("gbtree", "gblinear", "dart"))
)
# hyperparameter tuning on the pima indians diabetes data set
instance = tune(
method = "hyperband",
task = tsk("pima"),
learner = lrn("classif.xgboost", eval_metric = "logloss"),
resampling = rsmp("cv", folds = 3),
measures = msr("classif.ce"),
search_space = search_space,
term_evals = 100
)
# best performing hyperparameter configuration
instance$result
Thanks for pointing this out. I will add the budget tag to the default search space. Until then you can use this code.
library(mlr3hyperband)
library(mlr3tuningspaces)
library(mlr3learners)
# get learner with search space in one go
learner = lts(lrn("classif.xgboost"))
# overwrite nrounds with budget tag
learner$param_set$values$nrounds = to_tune(p_int(1000, 5000, tags = "budget"))
instance = tune(
method = "hyperband",
task = tsk("pima"),
learner = learner,
resampling = rsmp("cv", folds = 3),
measures = msr("classif.ce"),
term_evals = 100
)
Update 28.06.2022
The new API in version 0.3.0 is
learner = lts(lrn("classif.xgboost"), nrounds = to_tune(p_int(1000, 5000, tags = "budget"))

Bayesian IRT Pymc3 - Parameter inference

I would like to estimate IRT model using PyMC3.
I generated data with the following distribution:
alpha_fix = 4
beta_fix = 100
theta= np.random.normal(100,15,1000)
prob = np.exp(alpha_fix*(theta-beta_fix))/(1+np.exp(alpha_fix*(theta-beta_fix)))
prob_tt = tt._shared(prob)
Then I created a model using PyMC3 to infer the parameter:
irt = pm.Model()
with irt:
# Priors
alpha = pm.Normal('alpha',mu = 4 , tau = 1)
beta = pm.Normal('beta',mu = 100 , tau = 15)
thau = pm.Normal('thau' ,mu = 100 , tau = 15)
# Modelling
p = pm.Deterministic('p',tt.exp(alpha*(thau-beta))/(1+tt.exp(alpha*(thau-beta))))
out = pm.Normal('o',p,observed = prob_tt)
Then I infer through the model:
with irt:
mean_field = pm.fit(10000,method='advi', callbacks=[pm.callbacks.CheckParametersConvergence(diff='absolute')])
Finally, Sample from the model to get compute posterior:
pm.plot_posterior(mean_field.sample(1000), color='LightSeaGreen');
But the results of the "alpha" (mean of 2.2) is relatively far from the expected one (4) even though the prior on alpha was well-calibrated.
Would you have an idea of the origin of this gap and how to fix it?
Thanks a lot,
out = pm.Normal('o',p,observed = prob_tt)
Why you are using Normal instead of Bernoulli ? Also, what is the variance of normal ?

bnlearn error in structural.em

I got an error when try to use structural.em in "bnlearn" package
This is the code:
cut.learn<- structural.em(cut.df, maximize = "hc",
+ maximize.args = "restart",
+ fit="mle", fit.args = list(),
+ impute = "parents", impute.args = list(), return.all = FALSE,
+ max.iter = 5, debug = FALSE)
Error in check.data(x, allow.levels = TRUE, allow.missing = TRUE,
warn.if.no.missing = TRUE, : at least one variable has no observed
values.
Did anyone have the same problems, please tell me how to fix it.
Thank you.
I got structural.em working. I am currently working on a python interface to bnlearn that I call pybnl. I also ran into the problem you desecribe above.
Here is a jupyter notebook that shows how to use StructuralEM from python marks.
The gist of it is described in slides-bnshort.pdf on page 135, "The MARKS Example, Revisited".
You have to create an inital fit with an inital imputed dataframe by hand and then provide the arguments to structural.em like so (ldmarks is the latent-discrete-marks dataframe where the LAT column only contains missing/NA values):
library(bnlearn)
data('marks')
dmarks = discretize(marks, breaks = 2, method = "interval")
ldmarks = data.frame(dmarks, LAT = factor(rep(NA, nrow(dmarks)), levels = c("A", "B")))
imputed = ldmarks
# Randomly set values of the unobserved variable in the imputed data.frame
imputed$LAT = sample(factor(c("A", "B")), nrow(dmarks2), replace = TRUE)
# Fit the parameters over an empty graph
dag = empty.graph(nodes = names(ldmarks))
fitted = bn.fit(dag, imputed)
# Although we've set imputed values randomly, nonetheless override them with a uniform distribution
fitted$LAT = array(c(0.5, 0.5), dim = 2, dimnames = list(c("A", "B")))
# Use whitelist to enforce arcs from the latent node to all others
r = structural.em(ldmarks, fit = "bayes", impute="bayes-lw", start=fitted, maximize.args=list(whitelist = data.frame(from = "LAT", to = names(dmarks))), return.all = TRUE)
You have to use bnlearn 4.4-20180620 or later, because it fixes a bug in the underlying impute function.

Storing plot objects in a list

I asked this question yesterday about storing a plot within an object. I tried implementing the first approach (aware that I did not specify that I was using qplot() in my original question) and noticed that it did not work as expected.
library(ggplot2) # add ggplot2
string = "C:/example.pdf" # Setup pdf
pdf(string,height=6,width=9)
x_range <- range(1,50) # Specify Range
# Create a list to hold the plot objects.
pltList <- list()
pltList[]
for(i in 1 : 16){
# Organise data
y = (1:50) * i * 1000 # Get y col
x = (1:50) # get x col
y = log(y) # Use natural log
# Regression
lm.0 = lm(formula = y ~ x) # make linear model
inter = summary(lm.0)$coefficients[1,1] # Get intercept
slop = summary(lm.0)$coefficients[2,1] # Get slope
# Make plot name
pltName <- paste( 'a', i, sep = '' )
# make plot object
p <- qplot(
x, y,
xlab = "Radius [km]",
ylab = "Services [log]",
xlim = x_range,
main = paste("Sample",i)
) + geom_abline(intercept = inter, slope = slop, colour = "red", size = 1)
print(p)
pltList[[pltName]] = p
}
# close the PDF file
dev.off()
I have used sample numbers in this case so the code runs if it is just copied. I did spend a few hours puzzling over this but I cannot figure out what is going wrong. It writes the first set of pdfs without problem, so I have 16 pdfs with the correct plots.
Then when I use this piece of code:
string = "C:/test_tabloid.pdf"
pdf(string, height = 11, width = 17)
grid.newpage()
pushViewport( viewport( layout = grid.layout(3, 3) ) )
vplayout <- function(x, y){viewport(layout.pos.row = x, layout.pos.col = y)}
counter = 1
# Page 1
for (i in 1:3){
for (j in 1:3){
pltName <- paste( 'a', counter, sep = '' )
print( pltList[[pltName]], vp = vplayout(i,j) )
counter = counter + 1
}
}
dev.off()
the result I get is the last linear model line (abline) on every graph, but the data does not change. When I check my list of plots, it seems that all of them become overwritten by the most recent plot (with the exception of the abline object).
A less important secondary question was how to generate a muli-page pdf with several plots on each page, but the main goal of my code was to store the plots in a list that I could access at a later date.
Ok, so if your plot command is changed to
p <- qplot(data = data.frame(x = x, y = y),
x, y,
xlab = "Radius [km]",
ylab = "Services [log]",
xlim = x_range,
ylim = c(0,10),
main = paste("Sample",i)
) + geom_abline(intercept = inter, slope = slop, colour = "red", size = 1)
then everything works as expected. Here's what I suspect is happening (although Hadley could probably clarify things). When ggplot2 "saves" the data, what it actually does is save a data frame, and the names of the parameters. So for the command as I have given it, you get
> summary(pltList[["a1"]])
data: x, y [50x2]
mapping: x = x, y = y
scales: x, y
faceting: facet_grid(. ~ ., FALSE)
-----------------------------------
geom_point:
stat_identity:
position_identity: (width = NULL, height = NULL)
mapping: group = 1
geom_abline: colour = red, size = 1
stat_abline: intercept = 2.55595281266726, slope = 0.05543539319091
position_identity: (width = NULL, height = NULL)
However, if you don't specify a data parameter in qplot, all the variables get evaluated in the current scope, because there is no attached (read: saved) data frame.
data: [0x0]
mapping: x = x, y = y
scales: x, y
faceting: facet_grid(. ~ ., FALSE)
-----------------------------------
geom_point:
stat_identity:
position_identity: (width = NULL, height = NULL)
mapping: group = 1
geom_abline: colour = red, size = 1
stat_abline: intercept = 2.55595281266726, slope = 0.05543539319091
position_identity: (width = NULL, height = NULL)
So when the plot is generated the second time around, rather than using the original values, it uses the current values of x and y.
I think you should use the data argument in qplot, i.e., store your vectors in a data frame.
See Hadley's book, Section 4.4:
The restriction on the data is simple: it must be a data frame. This is restrictive, and unlike other graphics packages in R. Lattice functions can take an optional data frame or use vectors directly from the global environment. ...
The data is stored in the plot object as a copy, not a reference. This has two
important consequences: if your data changes, the plot will not; and ggplot2 objects are entirely self-contained so that they can be save()d to disk and later load()ed and plotted without needing anything else from that session.
There is a bug in your code concerning list subscripting. It should be
pltList[[pltName]]
not
pltList[pltName]
Note:
class(pltList[1])
[1] "list"
pltList[1] is a list containing the first element of pltList.
class(pltList[[1]])
[1] "ggplot"
pltList[[1]] is the first element of pltList.
For your second question: Multi-page pdfs are easy -- see help(pdf):
onefile: logical: if true (the default) allow multiple figures in one
file. If false, generate a file with name containing the
page number for each page. Defaults to ‘TRUE’.
For your main question, I don't understand if you want to store the plot inputs in a list for later processing, or the plot outputs. If it is the latter, I am not sure that plot() returns an object you can store and retrieve.
Another suggestion regarding your second question would be to use either Sweave or Brew as they will give you complete control over how you display your multi-page pdf.
Have a look at this related question.