How to structure multi-output bayesian optimization - optimization

I am trying to use bayesian optimization for a multi-output problem, but am not 100% sure the best way to set it up.
I have a small number of inputs (5) and outputs (3-4) in my problem. For each output, I have a target value I would like to achieve. Ultimately I would like to minimize the MSE between the target vector (of 3-4 outputs) and the true outputs.
The simplest way to do this in my mind is to create a single model which models the MSE as a function the problem inputs. Here, all historical data is first compressed into the single MSE, then this is used to train the GP.
However, I would instead like to create individual models (or a combined, multi-output model) that directly models the outputs of interest, instead of the ultimate cost function (MSE). Primarily, this is because I have noticed more accurate predictive results (of the combined MSE), when first modeling the individual outputs, then creating a MSE, instead of directly modeling the MSE.
My problem arises when creating the acquisition function when I have multiple outputs. Ideally, I'd like to use expected improvement (EI) as my acquisition function. However, I'm not sure how to either 1) combine the multiple output distributions into a single distribution (representing the probability of the combined MSE), which can then be used to determine overall EI or 2) how to combine multiple EI values into a single metric (i.e. combine the E.I. for each output into a unified E.I.).
When reading about multi-output BO, the most common approach seems to be to identify a frontier of solutions, however this is not 100% applicable, as ultimately I can convert the output vector into a single MSE (and the frontier becomes a point).
Is the best approach simply to model the combined MSE directly? Or is there a way that I can model the individual outputs, then combine these modeled outputs into a reasonable acquisition function?

Related

How to implement feature importance on nominal categorical features in tree based classifiers?

I am using SKLearn XGBoost model for my binary classification problem. My data contains nominal categorical features (such as race) for which one hot encoding should be used to feed them to the tree based models.
On the other hand, using feature_importances_ variable of XGBoost yields us the importance of each column on the trained model. So if I do the encoding and then get the features importance of columns, the result will includes names like race_2 and its importance.
What should I do to solve this problem and get a whole score for each nominal feature? Can I take the average of one hot encoded columns importance scores that belong to one feature? (like race_1, race_2 and race_3)
First of all, if your goal is to select the most useful features for later training, I would advise you to use regularization in your model. In the case of xgboost, you can tune the parameter gamma so the model would actually be more dependent on "more useful" features (i.e. tune the minimum loss reduction required for the model to add a partition leaf). Here is a good article on implementing regularization into xgboost models.
On the other hand, if you insist on doing feature importance, I would say grouping the encoded variables and simply adding them is not a good decision. This would result in feature-importance results that do not consider the relationship between these dummy variables.
My suggestion would be to take a look at the permutation tools for this. The basic idea is you take your original dataset, shuffle the values on the column in which you are going to calculate feature importance, train the model and record the score. Repeat this over different columns and the effect of each on the model performance would be a sign of their importance.
It is actually easier done than said, sklearn has this feature built-in to do for you: check out the example provided in here.

Multiple BERT binary classifications on a single graph to save on inference time

I have five classes and I want to compare four of them against one and the same class. This isn't a One vs Rest classifier, as for each output I want to score them against one base class.
The four outputs should be: base class vs classA, base class vs classB, etc.
I could do this by having multiple binary classification tasks, but that's wasting computation time if the first layers are BERT preprocessing + pretrained BERT layers, and the only differences between the four classifiers are the last few layers of BERT (finetuned ones) and the Dense layer.
So why not merge the graphs for more performance?
My inputs are four different datasets, each annotated with true/false for each class.
As I understand it, I can re-use most of the pipeline (BERT preprocessing and the first layers of BERT), as those have shared weights. I should then be able to train the last few layers of BERT and the Dense layer on top differently depending on the branch of the classifier (maybe using something like keras.switch?).
I have tried many alternative options including multi-class and multi-label classifiers, with actual and generated (eg, machine-annotated) labels in the case of multiple input labels, different activation and loss functions, but none of the results were acceptable to me (none were as good as the four separate models).
Is there a solution for merging the four different models for more performance, or am I stuck with using 4x binary classifiers?
When you train DNN for specific task it will be (in vast majority of cases) be better than the more general model that can handle several task simultaneously. Saying that, based on my experience the properly trained general model produces very similar results to the original binary ones. Anyways, here couple of suggestions for training strategies (assuming your training datasets for each task are completely different):
Weak supervision approach
Train your binary classifiers, and label your datasets using them (i.e. label with binary classifier trained on dataset 2 datasets [1,3,4]). Then train your joint model as multilabel task using all the newly labeled datasets (don't forget to randomize samples before feeding them to trainer ;) ). Here you will need to experiment if you will use threshold and set a label to 0/1 or use the scores of the binary classifiers.
Create custom loss function that will not penalize if no information provided for certain class. So when your will introduce sample from (say) dataset 2, your loss will be calculated only for the 2nd class.
Of course you can apply both simultaneously. For example, if you know that binary classifier produces scores that are polarized (most results are near 0 or 1), you can use weak labels, and automatically label your data with scores. Now during the second stage penalize loss such that for score x' = 4(x-0.5)^2 (note that you get logits from the model, so you will need to apply sigmoid function). This way you will increase contribution of the samples binary classifier is confident about, and reduce that of less certain ones.
As for releasing last layers of BERT, usually unfreezing upper 3-6 layers is enough. Releasing more layers improves results very little and increases time and memory requirements.

Is there a way to retrieve the weights from a GPflow GPR model?

Is there a way to retrieve the weights from a GPflow GPR model?
I do not necessarily need the explicit weights. However, I have two issues that may be solved using the weights:
I would like to compile and send a trained model to a third party. I
would like to do this without sending the training data and without
the third party having access to the training data.
I would like to be able to predict new mean values without
calculating new variances. Currently predict_f calculates both the
mean and the variance, but I only use the mean. I believe I could
speed up my prediction significantly if I didn't calculate the
variance.
I could resolve both of these issues if I could retrieve the weights from the GPR model after training. However, if it is possible to resolve these tasks without ever dealing with explicit weights, that would be even better.
It's not entirely clear what you mean by "explicit weights", but if you mean alpha = Kxx^{-1} y where Kxx is the evaluation of k(x,x') and y is the vector of observation targets, then you can get that by using the Posterior object (see https://github.com/GPflow/GPflow/blob/develop/gpflow/posteriors.py), which you get by calling posterior = model.posterior(). You can then access posterior.alpha.
Re 1.: However, for predictions you still need to be able to compute Kzx the covariance between new test points and the training points, so you will also need to provide the training locations and kernel hyperparameters.
This also means that you cannot rely on this to keep your training data secret, as the third party could simply compute Kxx instead of Kzx and then get back y = Kxx # alpha. You can avoid sharing exact (x,y) training set pairs by using a sparse approximation (this would remove "individual identifiability" at least). But I still wouldn't rely on it for privacy.
Re 2.: The Posterior object already provides much faster predictions; if you only ask for full_cov=False (marginal variances, the default), then you're at worst about a factor ~3 or so slower than predicting just the mean (in practice, I would guesstimate less than 1.5x as slow). As of GPflow 2.3.0, there is no implementation within GPflow of predicting the mean only.

Predict all probable trajectories in a grid structure using Keras

I'm trying to predict sequences of 2D coordinates. But I don't want only the most probable future path but all the most probable paths to visualize it in a grid map.
For this I have traning data consisting of 40000 sequences. Each sequence consists of 10 2D coordinate pairs as input and 6 2D coordinate pairs as labels.
All the coordinates are in a fixed value range.
What would be my first step to predict all the probable paths? To get all probable paths I have to apply a softmax in the end, where each cell in the grid is one class right? But how to process the data to reflect this grid like structure? Any ideas?
A softmax activation won't do the trick I'm afraid; if you have an infinite number of combinations, or even a finite number of combinations that do not already appear in your data, there is no way to turn this into a multi-class classification problem (or if you do, you'll have loss of generality).
The only way forward I can think of is a recurrent model employing variational encoding. To begin with, you have a lot of annotated data, which is good news; a recurrent network fed with a sequence X (10,2,) will definitely be able to predict a sequence Y (6,2,). But since you want not just one but rather all probable sequences, this won't suffice. Your implicit assumption here is that there is some probability space hidden behind your sequences, which affects how they play out over time; so to model the sequences properly, you need to model that latent probability space. A Variational Auto-Encoder (VAE) does just that; it learns the latent space, so that during inference the output prediction depends on sampling over that latent space. Multiple predictions over the same input can then result in different outputs, meaning that you can finally sample your predictions to empirically approximate the distribution of potential outputs.
Unfortunately, VAEs can't really be explained within a single paragraph over stackoverflow, and even if they could I wouldn't be the most qualified person to attempt it. Try searching the web for LSTM-VAE and arm yourself with patience; you'll probably need to do some studying but it's definitely worth it. It might also be a good idea to look into Pyro or Edward, which are probabilistic network libraries for python, better suited to the task at hand than Keras.

Caching Computations in TensorFlow

Is there a canonical way to reuse computations from a previously-supplied placeholder in TensorFlow? My specific use case:
supply many inputs (using one placeholder) simultaneously, all of which are fed through a network to obtain smaller representations
define a loss based on various combinations of these smaller representations
train on one batch at a time, where each batch uses some subset of the inputs, without recomputing the smaller representations
Here is the goal in code, but which is defective because the same computations are carried out again and again:
X_in = some_fixed_data
combinations_in = large_set_of_combination_indices
for combination_batch_in in batches(combinations_in, batch_size=128):
session.run(train_op, feed_dict={X: X_in, combinations: combination_batch_in})
Thanks.
The canonical way to share computed values across sess.Run() calls is to use a Variable. In this case, you could set up your graph so that when the Placeholders are fed, they compute a new value of the representation that is saved into a Variable. A separate portion of the graph reads those Variables to compute the loss. This will not work if you need to compute gradients through the part of the graph that computes the representation. Computing those gradients will require recomputing every Op in the encoder.
This is the kind of thing that should be solved automatically with CSE (common subexpression elimination). Not sure what the support in TensorFlow right now, might be kind of spotty, but there's optimizer_do_cse flag for Graph options which is defaulting to false, and you can set it to true using GraphConstructorOptions. Here's a C++ example of using GraphConstructorOptions (sorry, couldn't find a Python one)
If that doesn't work, you could do "manual CSE", ie, figure out which part is being needlessly recomputed, factor it out into separate Tensor, and reference that tensor in all the calculations.