How to set "budget" tag for xgboost hyperband optimization with mlr3tuningspaces? - xgboost

I am trying to tune xgboost with hyperband and I would like to use the suggested default tuning space from the mlr3tuningspaces package. However, I don't find how to tag a hyperparameter with "budget" while using lts .
Below, I reproduced the mlr3hyperband package example to illustrate my issue:
library(mlr3verse)
library(mlr3hyperband)
library(mlr3tuningspaces)
## this does not work, because I don't know how to tag a hyperparameter
## with "budget" while using the suggested tuning space
search_space = lts("classif.xgboost.default")
search_space$values
## this works because it has a hyperparameter (nrounds) tagged with "bugdget"
search_space = ps(
nrounds = p_int(lower = 1, upper = 16, tags = "budget"),
eta = p_dbl(lower = 0, upper = 1),
booster = p_fct(levels = c("gbtree", "gblinear", "dart"))
)
# hyperparameter tuning on the pima indians diabetes data set
instance = tune(
method = "hyperband",
task = tsk("pima"),
learner = lrn("classif.xgboost", eval_metric = "logloss"),
resampling = rsmp("cv", folds = 3),
measures = msr("classif.ce"),
search_space = search_space,
term_evals = 100
)
# best performing hyperparameter configuration
instance$result

Thanks for pointing this out. I will add the budget tag to the default search space. Until then you can use this code.
library(mlr3hyperband)
library(mlr3tuningspaces)
library(mlr3learners)
# get learner with search space in one go
learner = lts(lrn("classif.xgboost"))
# overwrite nrounds with budget tag
learner$param_set$values$nrounds = to_tune(p_int(1000, 5000, tags = "budget"))
instance = tune(
method = "hyperband",
task = tsk("pima"),
learner = learner,
resampling = rsmp("cv", folds = 3),
measures = msr("classif.ce"),
term_evals = 100
)
Update 28.06.2022
The new API in version 0.3.0 is
learner = lts(lrn("classif.xgboost"), nrounds = to_tune(p_int(1000, 5000, tags = "budget"))

Related

Bayesian IRT Pymc3 - Parameter inference

I would like to estimate IRT model using PyMC3.
I generated data with the following distribution:
alpha_fix = 4
beta_fix = 100
theta= np.random.normal(100,15,1000)
prob = np.exp(alpha_fix*(theta-beta_fix))/(1+np.exp(alpha_fix*(theta-beta_fix)))
prob_tt = tt._shared(prob)
Then I created a model using PyMC3 to infer the parameter:
irt = pm.Model()
with irt:
# Priors
alpha = pm.Normal('alpha',mu = 4 , tau = 1)
beta = pm.Normal('beta',mu = 100 , tau = 15)
thau = pm.Normal('thau' ,mu = 100 , tau = 15)
# Modelling
p = pm.Deterministic('p',tt.exp(alpha*(thau-beta))/(1+tt.exp(alpha*(thau-beta))))
out = pm.Normal('o',p,observed = prob_tt)
Then I infer through the model:
with irt:
mean_field = pm.fit(10000,method='advi', callbacks=[pm.callbacks.CheckParametersConvergence(diff='absolute')])
Finally, Sample from the model to get compute posterior:
pm.plot_posterior(mean_field.sample(1000), color='LightSeaGreen');
But the results of the "alpha" (mean of 2.2) is relatively far from the expected one (4) even though the prior on alpha was well-calibrated.
Would you have an idea of the origin of this gap and how to fix it?
Thanks a lot,
out = pm.Normal('o',p,observed = prob_tt)
Why you are using Normal instead of Bernoulli ? Also, what is the variance of normal ?

Is non-identical not enough to be considered 'distinct' for kmeans centroids?

I have an issue with kmeans clustering providing centroids. I saw the same problem already asked (
K-means: Initial centers are not distinct), but the solution in that post is not working in my case.
I selected the centroids using ClusterR::Kmeans_arma. I confirmed that my centroids are not identical using mgcv::uniquecombs, but still got the initial centers are not distinct error.
> dim(t(dat))
[1] 13540 11553
> centroids = ClusterR::KMeans_arma(data = t(dat), centers = 561,
n_iter = 50, seed_mode = "random_subset",
verbose = FALSE, CENTROIDS = NULL)
> dim(centroids)
[1] 561 11553
> x = mgcv::uniquecombs(centroids)
> dim(x)
[1] 561 11553
> res = kmeans(t(dat), centers = centroids, iter.max = 200)
Error in kmeans(t(dat), centers = centroids, iter.max = 200) :
initial centers are not distinct
Any suggestion to resolve this? Thanks!
I replicated the issue you've mentioned with the following data:
cols = 13540
rows = 11553
set.seed(1)
vec_dat = runif(rows * cols)
dat = matrix(vec_dat, nrow = rows, ncol = cols)
dim(dat)
dat = t(dat)
dim(dat)
There is no 'centers' parameter in the 'ClusterR::KMeans_arma()' function, therefore I've assumed you actually mean 'clusters',
centroids = ClusterR::KMeans_arma(data = dat,
clusters = 561,
n_iter = 50,
seed_mode = "random_subset",
verbose = TRUE,
CENTROIDS = NULL)
str(centroids)
dim(centroids)
The 'centroids' is a matrix of class "k-means clustering". If your intention is to come to the clusters then you can use,
clust = ClusterR::predict_KMeans(data = dat,
CENTROIDS = centroids,
threads = 6)
length(unique(clust)) # 561
class(centroids) # "k-means clustering"
If you want to pass the 'centroids' to the base R 'kmeans' function you have to set the 'class' of the 'centroids' object to NULL and that because the base R 'kmeans' function uses internally the base R 'duplicated()' function (you can view this by using print(kmeans) in the R console) which does not recognize the 'centroids' object as a matrix or data.frame (it is an object of class "k-means clustering") and performs the checking column-wise rather than row-wise. Therefore, the following should work for your case,
class(centroids) = NULL
dups = duplicated(centroids)
sum(dups) # this should actually give 0
res = kmeans(dat, centers = centroids, iter.max = 200)
I've made a few adjustments to the "ClusterR::predict_KMeans()" and particularly I've added the "threads" parameter and a check for duplicates, therefore if you want to come to the clusters using multiple cores you have to install the package from Github using,
remotes::install_github('mlampros/ClusterR',
upgrade = 'always',
dependencies = TRUE,
repos = 'https://cloud.r-project.org/')
The changes will take effect in the next version of the CRAN package which will be "1.2.2"
UPDATE regarding output and performance (based on your comment):
data(dietary_survey_IBS, package = 'ClusterR')
kmeans_arma = function(data) {
km_cl = ClusterR::KMeans_arma(data,
clusters = 2,
n_iter = 10,
seed_mode = "random_subset",
seed = 1)
pred_cl = ClusterR::predict_KMeans(data = data,
CENTROIDS = km_cl,
threads = 1)
return(pred_cl)
}
km_arma = kmeans_arma(data = dietary_survey_IBS)
km_algos = c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen")
for (algo in km_algos) {
cat('base-kmeans-algo:', algo, '\n')
km_base = kmeans(dietary_survey_IBS,
centers = 2,
iter.max = 10,
nstart = 1, # can be set to 5 or 10 etc.
algorithm = algo)
km_cl = as.vector(km_base$cluster)
print(table(km_arma, km_cl))
cat('--------------------------\n')
}
microbenchmark::microbenchmark(kmeans(dietary_survey_IBS,
centers = 2,
iter.max = 10,
nstart = 1, # can be set to 5 or 10 etc.
algorithm = algo), kmeans_arma(data = dietary_survey_IBS), times = 100)
I don't see any significant difference in the output clusters between the 'base R kmeans' and the 'kmeans_arma' function for all available 'base R kmeans' algorithms (you can test it also for your own data sets). I am not sure which algorithm the 'armadillo' library uses internally and moreover the 'base R kmeans' includes the 'nstart' parameter (you can consult the documentation for more info). Regarding performance you won't see any substantial differences for small to medium data sets but due to the fact that the armadillo library uses OpenMP internally in case that your computer has more than 1 cores then for big data sets I think the 'ClusterR::KMeans_arma' function will return the 'centroids' faster.

bnlearn error in structural.em

I got an error when try to use structural.em in "bnlearn" package
This is the code:
cut.learn<- structural.em(cut.df, maximize = "hc",
+ maximize.args = "restart",
+ fit="mle", fit.args = list(),
+ impute = "parents", impute.args = list(), return.all = FALSE,
+ max.iter = 5, debug = FALSE)
Error in check.data(x, allow.levels = TRUE, allow.missing = TRUE,
warn.if.no.missing = TRUE, : at least one variable has no observed
values.
Did anyone have the same problems, please tell me how to fix it.
Thank you.
I got structural.em working. I am currently working on a python interface to bnlearn that I call pybnl. I also ran into the problem you desecribe above.
Here is a jupyter notebook that shows how to use StructuralEM from python marks.
The gist of it is described in slides-bnshort.pdf on page 135, "The MARKS Example, Revisited".
You have to create an inital fit with an inital imputed dataframe by hand and then provide the arguments to structural.em like so (ldmarks is the latent-discrete-marks dataframe where the LAT column only contains missing/NA values):
library(bnlearn)
data('marks')
dmarks = discretize(marks, breaks = 2, method = "interval")
ldmarks = data.frame(dmarks, LAT = factor(rep(NA, nrow(dmarks)), levels = c("A", "B")))
imputed = ldmarks
# Randomly set values of the unobserved variable in the imputed data.frame
imputed$LAT = sample(factor(c("A", "B")), nrow(dmarks2), replace = TRUE)
# Fit the parameters over an empty graph
dag = empty.graph(nodes = names(ldmarks))
fitted = bn.fit(dag, imputed)
# Although we've set imputed values randomly, nonetheless override them with a uniform distribution
fitted$LAT = array(c(0.5, 0.5), dim = 2, dimnames = list(c("A", "B")))
# Use whitelist to enforce arcs from the latent node to all others
r = structural.em(ldmarks, fit = "bayes", impute="bayes-lw", start=fitted, maximize.args=list(whitelist = data.frame(from = "LAT", to = names(dmarks))), return.all = TRUE)
You have to use bnlearn 4.4-20180620 or later, because it fixes a bug in the underlying impute function.

If I don't want to train in batches and my state is a vector, what should my tensors have for a shape?

I'm trying to use tensorflow to solve a reinforced learning problem. I created an gym environment of my own. The state is a one dimensional array (size 224) and there are 170 actions to choose from (0...169). I do not want to train in batches. What I want is to make the most simple version of the RL problem running with tensorflow.
My main problem is, i guess the dimensions. I would assume that TF would allow me to input the state as 1D tensor. But then I get an error when I want to calculate W*input=action. Dimensions error make it hard to know whats right. Also, examples on the web focus on training from images, in batches.
In general, I started in this tutorial, but the state is encoded differently, which again makes it hard to follow (especially since I'm not really familiar with python).
import gym
import numpy as np
import random
import tensorflow as tf
env = gym.make('MyOwnEnv-v0')
n_state = 224
n_action = 170
sess = tf.InteractiveSession()
# Implementing the network itself
inputs1 = tf.placeholder(shape=[1,n_state],dtype=tf.float32)
W = tf.Variable(tf.random_uniform([n_state,n_action],0,0.01))
Qout = tf.transpose(tf.matmul(inputs1,W))
predict = tf.reshape(tf.argmax(Qout,1), [n_action,1])
#Below we obtain the loss by taking the sum of squares difference between the target and prediction Q values.
nextQ = tf.placeholder(shape=[n_action,1],dtype=tf.float32)
loss = tf.reduce_sum(tf.square(nextQ - Qout))
trainer = tf.train.GradientDescentOptimizer(learning_rate=0.1)
updateModel = trainer.minimize(loss)
# Training the network
init = tf.global_variables_initializer()
print("input: ", inputs1.get_shape()
, "\nW: ", W.get_shape()
, "\nQout: ", Qout.get_shape()
, "\npredict:", predict.get_shape()
, "\nnextQ: ", nextQ.get_shape()
, "\nloss: ", loss.get_shape())
# Set learning parameters
y = .99
e = 0.1
num_episodes = 2000
#create lists to contain total rewards and steps per episode
jList = []
rList = []
with tf.Session() as sess:
sess.run(init)
for i in range(num_episodes):
#Reset environment and get first new observation
s = env.reset()
rAll = 0
d = False
j = 0
#The Q-Network
while j < 99:
j+=1
#Choose an action by greedily (with e chance of random action) from the Q-network
a,allQ = sess.run([predict,Qout],feed_dict={inputs1:s})
if np.random.rand(1) < e:
a = env.action_space.sample()
#Get new state and reward from environment
s1,r,d,_ = env.step(a)
#Obtain the Q' values by feeding the new state through our network
Q1 = sess.run(Qout,feed_dict={inputs1:s1})
#Obtain maxQ' and set our target value for chosen action.
maxQ1 = np.max(Q1)
targetQ = allQ
#targetQ[0,a[0]] = r + y*maxQ1
targetQ[a,0] = r + y*maxQ1
#Train our network using target and predicted Q values
_,W1 = sess.run([updateModel,W],feed_dict={inputs1:s,nextQ:targetQ})
rAll += r
s = s1
if d == True:
#Reduce chance of random action as we train the model.
e = 1./((i/50) + 10)
break
jList.append(j)
rList.append(rAll)
print('Percent of succesful episodes: ' + str(sum(rList)/num_episodes) + '%')

Perform Kriging Interpolation using Arcpy

I have list of point feature class. I am trying to write a python script to perform Krigging interpolation. I am getting error massage in this code "Point_Num" is not defined,
Below script i am working
import arcpy
from arcpy import env
from arcpy.sa import *
arcpy.env.overwriteOutput = True
# Check out the ArcGIS Spatial Analyst extension license
arcpy.CheckOutExtension("Spatial")
In_Point = r"D:\NPP_Test\MERRA_TEMP_2012C" #(Point feature name:r001_mean, r002_mean.....r012_mean )
Out_Raster = r"D:\NPP_Test\MERRA_TEMP_2012D"
points = arcpy.ListFeatureClasses()
zFields = "GRID_CODE"
#Kriging Veriable
cellSize = 0.05
lagSize = 0.5780481172534
majorRange = 6
partialSill = 3.304292110
nugget = 0.002701348
kRadius = RadiusFixed(20000, 1)
#Mask region of interest
mask = r"D:\Gujarta Shape file\GUJARATSTATE.shp"
for zField in zFields:
Point = Point_Num[:3]
kModelUniversalObj = KrigingModelUniversal("LINEARDRIFT", lagSize, majorRange, partialSill, nugget)
OutKriging = Kriging(inPointFeatures, zField, kModelUniversalObj, cellSize, kRadius)
#IDWMASk = ExtractByMask(outIDW, mask)
KrigMask = ExtractByMask(OutKriging, mask)
#Save outraster as the same name of input
KrigMask.save("r{}.tif".format(Point_Num))