what is the kappa variable (BayesianOptimization) - optimization

I read some posts and tutorials about BayesianOptimization and I never saw explanation about kappa variable.
What is the kappa variable ?
How can it help us ?
How this values can influence the BayesianOptimization process ?

The kappa parameter, along with xi, is used to control how much the Bayesian optimization acquisition function balances exploration and exploitation.
Higher kappa values mean more exploration and less exploitation and vice versa for low values. Exploration pushes the search towards unexplored regions and exploitation focuses on results in the vicinity of the current best results by penalizing for higher variance values.
It may be beneficial to begin with default kappa values at the start of optimization and then lower values if you reduce the search space.
In scikit-optimize, kappa is only used if the acquisition function acq_func is set to “LCB” and xi is used when acq_func is “EI” or “PI” where LCB is Lower Confidence Bound, EI is Expected Improvement and PI is Probability of Improvement.
Similarly for the BayesianOptimization package:
acq: {'ucb', 'ei', 'poi'}
The acquisition method used.
* 'ucb' stands for the Upper Confidence Bounds method
* 'ei' is the Expected Improvement method
* 'poi' is the Probability Of Improvement criterion.
Mathematical details on acquisition functions
Note, the BayesianOptimization package and scikit-optimize use different default kappa values: 2.576 and 1.96 respectively.
There is a decent exploration vs exploitation example in the scikit-optimize docs.
There is a similar BayesianOptimization exploration vs exploitation example notebook.
FWIW I've used both packages and gotten OK results. I find the scikit-optimize plotting functions to be useful when fine tuning the parameter search space.

Related

Isn't it dangerous to apply Min Max Scaling to the test set?

Here's the situation I am worrying about.
Let me say I have a model trained with min-max scaled data. I want to test my model, so I also scaled the test dataset with my old scaler which was used in the training stage. However, my new test data's turned out to be the newer minimum, so the scaler returned negative value.
As far as I know, minimum and maximum aren't that stable value, especially in the volatile dataset such as cryptocurrency data. In this case, should I update my scaler? Or should I retrain my model?
I happen to disagree with #Sharan_Sundar. The point of scaling is to bring all of your features onto a single scale, not to rigorously ensure that they lie in the interval [0,1]. This can be very important, especially when considering regularization techniques the penalize large coefficients (whether they be linear regression coefficients or neural network weights). The combination of feature scaling and regularization help to ensure your model generalizes to unobserved data.
Scaling based on your "test" data is not a great idea because in practice, as you pointed out, you can easily observe new data points that don't lie within the bounds of your original observations. Your model needs to be robust to this.
In general, I would recommend considering different scaling routines. scikitlearn's MinMaxScaler is one, as is StandardScaler (subtract mean and divide by standard deviation). In the case where your target variable, cryptocurrency price can vary over multiple orders of magnitude, it might be worth using the logarithm function for scaling some of your variables. This is where data science becomes an art -- there's not necessarily a 'right' answer here.
(EDIT) - Also see: Do you apply min max scaling separately on training and test data?
Ideally you should scale first and then only split into test and train. But its not preferable to use minmax scaler with data which can have dynamically varying min and max values with significant variance in realtime scenario.

Machine learning: why the cost function does not need to be derivable?

I was playing around with Tensorflow creating a customized loss function and this question about general machine learning arose to my head.
My understanding is that the optimization algorithm needs a derivable cost function to find/approach a minimum, however we can use functions that are non-derivable such as the absolute function (there is no derivative when x=0). A more extreme example, I defined my cost function like this:
def customLossFun(x,y):
return tf.sign(x)
and I expected an error when running the code, but it actually worked (it didn't learn anything but it didn't crash).
Am I missing something?
You're missing the fact that the gradient of the sign function is somewhere manually defined in the Tensorflow source code.
As you can see here:
def _SignGrad(op, _):
"""Returns 0."""
x = op.inputs[0]
return array_ops.zeros(array_ops.shape(x), dtype=x.dtype)
the gradient of tf.sign is defined to be always zero. This, of course, is the gradient where the derivate exists, hence everywhere but not in zero.
The tensorflow authors decided to do not check if the input is zero and throw an exception in that specific case
In order to prevent TensorFlow from throwing an error, the only real requirement is that you cost function evaluates to a number for any value of your input variables. From a purely "will it run" perspective, it doesn't know/care about the form of the function its trying to minimize.
In order for your cost function to provide you a meaningful result when TensorFlow uses it to train a model, it additionally needs to 1) get smaller as your model does better and 2) be bounded from below (i.e. it can't go to negative infinity). It's not generally necessary for it to be smooth (e.g. abs(x) has a kink where the sign flips). Tensorflow is always able to compute gradients at any location using automatic differentiation (https://en.wikipedia.org/wiki/Automatic_differentiation, https://www.tensorflow.org/versions/r0.12/api_docs/python/train/gradient_computation).
Of course, those gradients are of more use if you've chose a meaningful cost function isn't isn't too flat.
Ideally, the cost function needs to be smooth everywhere to apply gradient based optimization methods (SGD, Momentum, Adam, etc). But nothing's going to crash if it's not, you can just have issues with convergence to a local minimum.
When the function is non-differentiable at a certain point x, it's possible to get large oscillations if the neural network converges to this x. E.g., if the loss function is tf.abs(x), it's possible that the network weights are mostly positive, so the inference x > 0 at all times, so the network won't notice tf.abs. However, it's more likely that x will bounce around 0, so that the gradient is arbitrarily positive and negative. If the learning rate is not decaying, the optimization won't converge to the local minimum, but will bound around it.
In your particular case, the gradient is zero all the time, so nothing's going to change at all.
If it didn't learn anything, what have you gained ? Your loss function is differentiable almost everywhere but it is flat almost anywhere so the minimizer can't figure out the direction towards the minimum.
If you start out with a positive value, it will most likely be stuck at a random value on the positive side even though the minima on the left side are better (have a lower value).
Tensorflow can be used to do calculations in general and it provides a mechanism to automatically find the derivative of a given expression and can do so across different compute platforms (CPU, GPU) and distributed over multiple GPUs and servers if needed.
But what you implement in Tensorflow does not necessarily have to be a goal function to be minimized. You could use it e.g. to throw random numbers and perform Monte Carlo integration of a given function.

xgboost vs H2o Gradient Boosting

I have a dataset having a large missing values (more than 40% missing). Genrated a model in xgboost and H2o gradient boosting - got a decent model in both cases. However, the xgboost shows this variable as one of the key contributors to the model but as per H2o Gradient Boosting the variable is not important. Does xgboost handle variables with missing values differently. All the configuration to both the models are exactly the same.
Both missing value handling and variable importances are slightly different between the two methods. Both are treating missing values as information (i.e., they learn from them, and don't just impute with a simple constant). The variable importances are computed from the gains of their respective loss functions during tree construction. H2O uses squared error, and XGBoost uses a more complicated one based on gradient and hessian.
One thing you could check is the variance of the variable importances between different runs with different seeds, to see how stable each method is in terms of variable importances.
PS. If you have categoricals, then you're better off leaving the column as a factor for H2O, no need to do your own encoding. This can lead to a different effective count of columns vs XGBoost's dataset, so for column sampling, things will be different.

Standard Errors for Differential Evolution

Is it possible to calculate standard errors for Differential Evolution?
From the Wikipedia entry:
http://en.wikipedia.org/wiki/Differential_evolution
It's not derivative based (indeed that is one of its strengths) but how then so you calculate the standard errors?
I would have thought some kind of bootstrapping strategy might have been applicable but can't seem to find any sources than apply bootstrapping to DE?
Baz
Concerning the standard errors, differential evolution is just like any other evolutionary algorithm.
Using a bootstrapping strategy seems a good idea: the usual formulas assume a normal (Gaussian) distribution for the underlying data. That's almost never true for evolutionary computation (exponential distributions being far more common, probably followed by bimodal distributions).
The simplest bootstrap method involves taking the original data set of N numbers and sampling from it to form a new sample (a resample) that is also of size N. The resample is taken from the original using sampling with replacement. This process is repeated a large number of times (typically 1000 or 10000 times) and for each of these bootstrap samples we compute its mean / median (each of these are called bootstrap estimates).
The standard deviation (SD) of the means is the bootstrapped standard error (SE) of the mean and the SD of the medians is the bootstrapped SE of the median (the 2.5th and 97.5th centiles of the means are the bootstrapped 95% confidence limits for the mean).
Warnings:
the word population is used with different meanings in different contexts (bootstrapping vs evolutionary algorithm)
in any GA or GP, the average of the population tells you almost nothing of interest. Use the mean/median of the best-of-run
the average of a set that is not normally distributed produces a value that behaves non-intuitively. Especially if the probability distribution is skewed: large values in "tail" can dominate and average tends to reflect the typical value of the "worst" data not the typical value of the data in general. In this case it's better the median
Some interesting links are:
A short guide to using statistics in Evolutionary Computation
An Introduction to Statistics for EC Experimental Analysis

How to analyse 'noisiness' of an array of points

Have done fft (see earlier posting if you are interested!) and got a result, which helps me. Would like to analyse the noisiness / spikiness of an array (actually a vb.nre collection of single). Um, how to explain ...
When signal is good, fft power results is 512 data points (frequency buckets) with low values in all but maybe 2 or 3 array entries, and a decent range (i.e. the peak is high, relative to the noise value in the nearly empty buckets. So when graphed, we have a nice big spike in the values in those few buckets.
When signal is poor/noisy, data values spread (max to min) is low, and there's proportionally higher noise in many more buckets.
What's a good, computationally non-intensive was of analysing the noisiness of this data set? Would some kind of statistical method, standard deviations or something help ?
The key is defining what is noise and what is signal, for which modelling assumptions must be made. Often an assumption is made of white noise (constant power per frequency band) or noise of some other power spectrum, and that model is fitted to the data. The signal to noise ratio can then be used to measure the amount of noise.
Fitting a noise model depends on the nature of your data: if you know that the real signal will have no power in the high frequency components, you can look there for an indication of the noise level, and use the model to predict what the noise will be at the lower frequency components where there is both signal and noise. Alternatively, if your signal is constant in time, taking multiple FFTs at different points in time and comparing them to get a standard deviation for each frequency band can give the level of noise present.
I hope I'm not patronising you to mention the issues inherent with windowing functions when performing FFTs: these can have the effect of introducing spurious "noise" into the frequency spectrum which is in fact an artifact of the periodic nature of the FFT. There's a tradeoff between getting sharp peaks and 'sideband' noise - more here www.ee.iitm.ac.in/~nitin/_media/ee462/fftwindows.pdf
Calculate a standard deviation and then you decide the threshold that will indicate noise. In practice this is usually easy and allows you to easily tweak the "noise level" as needed.
There is a nice single pass stddev algorithm in Knuth. Here is link that describes an implementation.
Standard Deviation
calculate the signal to noise ratio
http://en.wikipedia.org/wiki/Signal-to-noise_ratio
you could also check the stdev for each point and if it's under some level you choose then the signal is good else it's not.
wouldn't the spike be
treated as a noise glitch in SNR, an
outlier to be discarded, as it were?
If it's clear from the time-domain data that there are such spikes, then they will certainly create a lot of noise in the frequency spectrum. Chosing to ignore them is a good idea, but unfortunately the FFT can't accept data with 'holes' in it where the spikes have been removed. There are two techniques to get around this. The 'dirty trick' method is to set the outlier sample to be the average of the two samples on either site, and compute the FFT with a full set of data.
The harder but more-correct method is to use a Lomb Normalised Periodogram (see the book 'Numerical Recipes' by W.H.Press et al.), which does a similar job to the FFT but can cope with missing data properly.