Does .sample() of a Tensorflow Probability model repeatedly call the model parameters? - tensorflow

I build a BNN model similar to the Keras example, which accounts for aleatory and epistemic uncertainties:
https://keras.io/examples/keras_recipes/bayesian_neural_networks/
However, I want to compute the uncertainties separately. In an outer loop I compute several predictions, each sampling new weights and biases their distribution functions. Between these predictions I compute the epistemic (model) uncertainty.
For the determination of the aleatory uncertainties I use a tfp.layers.IndependentNormal() layer, through which the model returns a distribution object. Since my data is normalized, I can't just use the .stdv() method, since the inverse transformation would not return the correct inverse normalized outputs (this would only be the case for .mean()). Instead, I need to sample from the distribution object, then inverse normalize the samples, and afterwards I can calculate my standard deviation in the initial range.
Now I'm not sure if the .sample() method e.g. for n=100 samples -> .sample(100) samples 100 times new model parameters of the distribution function of the weights and biases. That would contaminate the aleatory uncertainty with the epistemic uncertainty.
I also saw that .sample() has a seed attribute, but am not completely sure if this prevents re-sampling of the model parameter distribution function for computations on a GPU.

Related

Is there a way to retrieve the weights from a GPflow GPR model?

Is there a way to retrieve the weights from a GPflow GPR model?
I do not necessarily need the explicit weights. However, I have two issues that may be solved using the weights:
I would like to compile and send a trained model to a third party. I
would like to do this without sending the training data and without
the third party having access to the training data.
I would like to be able to predict new mean values without
calculating new variances. Currently predict_f calculates both the
mean and the variance, but I only use the mean. I believe I could
speed up my prediction significantly if I didn't calculate the
variance.
I could resolve both of these issues if I could retrieve the weights from the GPR model after training. However, if it is possible to resolve these tasks without ever dealing with explicit weights, that would be even better.
It's not entirely clear what you mean by "explicit weights", but if you mean alpha = Kxx^{-1} y where Kxx is the evaluation of k(x,x') and y is the vector of observation targets, then you can get that by using the Posterior object (see https://github.com/GPflow/GPflow/blob/develop/gpflow/posteriors.py), which you get by calling posterior = model.posterior(). You can then access posterior.alpha.
Re 1.: However, for predictions you still need to be able to compute Kzx the covariance between new test points and the training points, so you will also need to provide the training locations and kernel hyperparameters.
This also means that you cannot rely on this to keep your training data secret, as the third party could simply compute Kxx instead of Kzx and then get back y = Kxx # alpha. You can avoid sharing exact (x,y) training set pairs by using a sparse approximation (this would remove "individual identifiability" at least). But I still wouldn't rely on it for privacy.
Re 2.: The Posterior object already provides much faster predictions; if you only ask for full_cov=False (marginal variances, the default), then you're at worst about a factor ~3 or so slower than predicting just the mean (in practice, I would guesstimate less than 1.5x as slow). As of GPflow 2.3.0, there is no implementation within GPflow of predicting the mean only.

where does class_weights or weighted loss penalize the network?

I am working on a Semantic segmentation project where I have to work on multiclass data which is highly imbalanced. I searched for optimizing it during training using the model.fit parameter and in that to use class_weights or sample_weights.
I can implement a following using a class_weight dictionary as
{ 0:1, 1:10,2:15 }
I also saw a method of updating weights in loss function
But at what point do these weights get updated?
If class_weights are used where will it get penalized? I already have a kernel_regularizer for each layer so if my classes have to be penalized based on my class weights then will it penalize the output of each layer y=Wx+b or only at the final layer?
Same if I use a weighted loss function will it get penalized only on the final layer before loss calculation or on each layer and then the final loss is calculated?
Any explanation on this would be very useful.
The class_weights you mentioned in your dictionary are there to account for your imbalanced data. They will never change, they are only there to increase the penalty for misclassified instances of minority classes (that way your network pays more attention to them and the gradients returned treat one 'Class2' instance as if it was 15 times more important than one 'Class0' instance).
The kernel_regularizer you mention resides at your loss function and penalizes large weight norms for weight matrices throughout the network (if you use kernel_regularizer = tf.keras.regularizers.l1(0.01) in a Dense layer, it only affects that layer). So that is a different weight that has nothing to do with classes, only with weights inside your network. Your eventual loss will be something like loss = Cross_entropy + a * norm(Weight_matrix) and that way the network will have as an additional task assigned to it to minimize the classification loss (cross entropy) while the weight norms remain low.

Keras: Custom loss function with training data not directly related to model

I am trying to convert my CNN written with tensorflow layers to use the keras api in tensorflow (I am using the keras api provided by TF 1.x), and am having issue writing a custom loss function, to train the model.
According to this guide, when defining a loss function it expects the arguments (y_true, y_pred)
https://www.tensorflow.org/guide/keras/train_and_evaluate#custom_losses
def basic_loss_function(y_true, y_pred):
return ...
However, in every example I have seen, y_true is somehow directly related to the model (in the simple case it is the output of the network). In my problem, this is not the case. How do implement this if my loss function depends on some training data that is unrelated to the tensors of the model?
To be concrete, here is my problem:
I am trying to learn an image embedding trained on pairs of images. My training data includes image pairs and annotations of matching points between the image pairs (image coordinates). The input feature is only the image pairs, and the network is trained in a siamese configuration.
I am able to implement this successfully with tensorflow layers and train it sucesfully with tensorflow estimators.
My current implementations builds a tf Dataset from a large database of tf Records, where the features is a dictionary containing the images and arrays of matching points. Before I could easily feed these arrays of image coordinates to the loss function, but here it is unclear how to do so.
There is a hack I often use that is to calculate the loss within the model, by means of Lambda layers. (When the loss is independent from the true data, for instance, and the model doesn't really have an output to be compared)
In a functional API model:
def loss_calc(x):
loss_input_1, loss_input_2 = x #arbirtray inputs, you choose
#according to what you gave to the Lambda layer
#here you use some external data that doesn't relate to the samples
externalData = K.constant(external_numpy_data)
#calculate the loss
return the loss
Using the outputs of the model itself (the tensor(s) that are used in your loss)
loss = Lambda(loss_calc)([model_output_1, model_output_2])
Create the model outputting the loss instead of the outputs:
model = Model(inputs, loss)
Create a dummy keras loss function for compilation:
def dummy_loss(y_true, y_pred):
return y_pred #where y_pred is the loss itself, the output of the model above
model.compile(loss = dummy_loss, ....)
Use any dummy array correctly sized regarding number of samples for training, it will be ignored:
model.fit(your_inputs, np.zeros((number_of_samples,)), ...)
Another way of doing it, is using a custom training loop.
This is much more work, though.
Although you're using TF1, you can still turn eager execution on at the very beginning of your code and do stuff like it's done in TF2. (tf.enable_eager_execution())
Follow the tutorial for custom training loops: https://www.tensorflow.org/tutorials/customization/custom_training_walkthrough
Here, you calculate the gradients yourself, of any result regarding whatever you want. This means you don't need to follow Keras standards of training.
Finally, you can use the approach you suggested of model.add_loss.
In this case, you calculate the loss exaclty the same way I did in the first answer. And pass this loss tensor to add_loss.
You can probably compile a model with loss=None then (not sure), because you're going to use other losses, not the standard one.
In this case, your model's output will probably be None too, and you should fit with y=None.

Is it possible to integrate Levenberg-Marquardt optimizer from Tensorflow Graphics with a Tensorflow 2.0 model?

I have a Tensorflow 2.0 tf.keras.Sequential model. Now, my technical specification prescribes using the Levenberg-Marquardt optimizer to fit the model. Tensorflow 2.0 doesn't provide it as an optimizer out of the box, but it is available in the Tensorflow Graphics module.
tfg.math.optimizer.levenberg_marquardt.minimize function accepts residuals ( a residual is a Python callable returning a tensor) and variables (list of tensors corresponding to my model weights) as parameters.
What would be the best way to convert my model into residuals and variables?
If I understand correctly how the minimize function works, I have to provide two residuals. The first residual must call my model for every learning case and aggregate all the results into a tensor. The second residuals must return all labels as a single constant tensor. The problem is that tf.keras.Sequential.predict function returns a numpy array instead of tensor. I believe that if I convert it to a tensor, the minimizer won't be able to calculate jacobians with respect to variables.
The same problem is with variables. It doesn't seem like there's a way to extract all weights from a model into a list of tensors.
There's a major difference between tfg.math.optimizer.levenberg_marquardt.minimize and Keras optimizers from the implementation/API perspective.
Keras optimizers, such as tf.keras.optimizers.Adam consume gradients as input and updates tf.Variables.
In contrast, tfg.math.optimizer.levenberg_marquardt.minimize essentially unrolls the optimization loop in graph mode (using a tf.while_loop construct). It takes initial parameter values and produces updated parameter values, unlike Adam & co, which only apply one iteration and actually change the values of tf.Variables via assign_add.
Stepping back a bit to the theoretical big picture, Levenberg-Marquardt is not a general gradient descent-like solver for any nonlinear optimization problem (such as Adam is). It specifically addresses nonlinear least-squares optimization, so it's not a drop-in replacement for optimizers like Adam. In gradient descent, we compute the gradient of the loss with respect to the parameters. In Levenberg-Marquardt, we compute the Jacobian of the residuals with respect to the parameters. Concretely, it repeatedly solves the linearized problem Jacobian # delta_params = residuals for delta_params using tf.linalg.lstsq (which internally uses Cholesky decomposition on the Gram matrix computed from the Jacobian) and applies delta_params as the update.
Note that this lstsq operation has cubic complexity in the number of parameters, so in case of neural nets it can only be applied for fairly small ones.
Also note that Levenberg-Marquardt is usually applied as a batch algorithm, not a minibatch algorithm like SGD, though there's nothing stopping you from applying the LM iteration on different minibatches in each iteration.
I think you may only be able to get one iteration out of tfg's LM algorithm, through something like
from tensorflow_graphics.math.optimizer.levenberg_marquardt import minimize as lm_minimize
for input_batch, target_batch in dataset:
def residual_fn(trainable_params):
# do not use trainable params, it will still be at its initial value, since we only do one iteration of Levenberg Marquardt each time.
return model(input_batch) - target_batch
new_objective_value, new_params = lm_minimize(residual_fn, model.trainable_variables, max_iter=1)
for var, new_param in zip(model.trainable_variables, new_params):
var.assign(new_param)
In contrast, I believe the following naive method will not work where we assign model parameters before computing the residuals:
from tensorflow_graphics.math.optimizer.levenberg_marquardt import minimize as lm_minimize
dataset_iterator = ...
def residual_fn(params):
input_batch, target_batch = next(dataset_iterator)
for var, param in zip(model.trainable_variables, params):
var.assign(param)
return model(input_batch) - target_batch
final_objective, final_params = lm_minimize(residual_fn, model.trainable_variables, max_iter=10000)
for var, final_param in zip(model.trainable_variables, final_params):
var.assign(final_param)
The main conceptual problem is that residual_fn's output has no gradients wrt its input params, since this dependency goes through a tf.assign. But it might even fail before that due to using constructs that are disallowed in graph mode.
Overall I believe it's best to write your own LM optimizer that works on tf.Variables, since tfg.math.optimizer.levenberg_marquardt.minimize has a very different API that is not really suited for optimizing Keras model parameters since you can't directly compute model(input, parameters) - target_value without a tf.assign.

Anyway to backprob derivatives when derivatives of the custom loss function are calculated by myself

I have been using tensorflow to train deep NN acoustic models for speech recognition for a while. The loss function I use is Cross Entropy and the NN models performe very well. Now I want to change the loss function to a more complex one named MMI (Maximum Mutual Information) which is also a classical criterion used in speech recognition domain. I put one paper here which describes this loss function in case that you have interests.
When using this special loss function, the derivatives of the loss function w.r.t. the activations of output layer can be computed by some special algorithms defined in Hidden Markov Model scenario. It means that I can compute the derivatives of the loss function w.r.t. the activations of output layer by myself rather than just write out the loss function and leave Tensorflow to calculate the derivatives automatically.
But based on my poor experiences, I don't know how to backprob the derivatives which I calculate by myself. Is there any way to do this without touching Tensorflow C++ source code?
Probably yes if all the computation involved use existing tensorflow functions.
You just have to set up the chain of operations that compute the gradients from the current variables.
Then you just use tf.assign_add() to the variables with your gradients multiplied by minus the learning rate.
You are thus mimicking what happens in the background in TF usually.
EDIT: If calculations are made in numpy for instance for the gradients you can use.
#perform numpy calculations
a=f(output_npy,variables_npy)
grad_from_user=tf.placeholder(tf.float32, a.shape)
grad_update=tf.assign_add(variables_tf,-lr*grad_from_user)
#and then
sess.run(grad_update,feed_dict={grad_from_user:a,...})