If I am trying to train a neural network to minimize some loss L, then we are finding the parameters that minimize that loss, given a set of training data.
What methods do we have of minimizing that loss subject to some inequality constraints?
Say the input is x, and the output is y(x,p)**, and the parameters are p. Is there a way to train it from the following statement?
argmin_p L
subject to
g(x,y(x,p)) < 0
I've googled around, and what I'm trying to do sounds pretty similar to NLP (just without an equality constraint) in that it's an inequality constrained minimization problem. I've found Tensorflow Constrained Optimization (TFCO), but I'm not seeing them train neural networks in any of the examples.
Related
I'm evaluating the possibility to introduce a new loss for the subject described above.
Let be the number of examples, the number of classes,
the classifier output on example and the binary indicator (0 or 1) if class label is the correct classification for the observation on pattern . It would be:
where the cross-entropy loss is used and c.e. losses are correlated by a similarity matrix S positive semi-definite.
Is there any method to write this custom loss on a software framework for deep learning (e.g.: TensorFlow) and have a backtracking algorithm that is based on it?
While the feed-forward-->backtracking cycle advances, new patterns are processed by the Neural Network. Being so, new errors are determined:
where x is the number of feed-forward steps.
So It should be possible to backtrack using the loss:
where x is the number of feed-forward steps.
In case you can help me, we can be co-authors of a paper about this subject.
Suppose I have vectors of dimension 1 x N {X_1...X_n} and {X_1' ...X_n'} where each X and X' are related but the relation is not able to be modeled by a function. I want to train a neural network by feeding it X_i and outputting Y_i with dimension N x 1, such that norm((X_i')(Y_i)) is maximized. The constraint is that Y_i has a norm of 1 (otherwise I will just use as large numbers as possible in Y_i).
I do not use X_i' as the inputs because they are not available in real life. I hope that when I test the neural network by feeding it {X_n+1 ... X_k}, it will output {Y_n+1 ... Y_k} where norm((X_n+1')(Y_n+1)) are maximized. Again, note that I only have {X_n+1'...X_k'} when testing, but not in real life where the neural network will be used.
I tried defining custom tensorflow or keras loss functions, but they don't seem to work. Also I tried using a neural network to first predict X_i' from X_i, but the performance is not very good.
A difficulty in this is to define a loss function that has no labels, and make neural network do backprop using this loss function. Any ideas how this may be achieved?
I have been trying to conduct a few experiments using TensorFlow Probability (TFP), and I got a few questions.
What is the proper value of the coefficient of the KL loss?
In the paper by Blundell (2015), the coefficient is set to 1/M (where M is the number of mini-batches). In the example given by TFP, the coefficient is given as 1/mnist_data.train.num_examples. Why?
As I go from 2d input to 3d images volumes, the KL loss is still significantly larger (~1k) than the cross-entropy (~1), even after dividing by mnist_data.train.num_examples. Why?
What is the guideline for getting a proper value for this coefficient? Maybe like the two-loss terms should be the same order of magnitude?
The current coefficient only takes care of the number of training samples, but not the network complexity or number of parameters in the network, which I assume the KL loss increase with the complexity of the model.
I am trying to implement a neural network with the KL loss, without using keras.model.losses, as some software production and hardware support limitation. I am trying to train my model with TF 1.10 and TFP 0.3.0., the issue is that for tf<=1.14, tf.keras.model does not support tf.layers inside the Keras model, so I can't use my original model straight away. Is there a way to get the KL loss, not from model.losses, but from layers or weights of the network in a TF construct?
Is batch normalization or group normalization still helpful in Bayesian deep learning?
In the paper by Blundell (2015), the coefficient is set to 1/M (where M is the number of mini-batches). In the example given by TFP, the coefficient is given as 1/mnist_data.train.num_examples. Why?
In the BBB paper eq. 8, they refer to M being the number of mini-batches. To be consistent with the non-stochastic gradient learning, it should be scaled by the number of mini-batches which is what is done by Graves. Another alternative is that done in eq. 9, where they scale it by \pi_i, where the sum of all the values in the set {\pi} sum to one.
In the TFP example, it does look like the num_examples is the total number of independent samples within the training set, which is much larger than the number of batches. This is goes by a few names, such as Safe Bayes or Tempering. Have a look at sec. 8 of this paper for some more discussion about the use of tempering within Bayesian inference and it's suitability.
As I go from 2d input to 3d images volumes, the KL loss is still significantly larger (~1k) than the cross-entropy (~1), even after dividing by mnist_data.train.num_examples. Why?
The ELBO will always be larger than just your cross-entropy (which defines your likelihood). Have a look at how the KL divergence term in the ELBO is found. (and a full mean-field approach where each weight/parameter is assumed to be independent).
Since the assumed posterior is factorised (assume each parameter is independent), can write the joint distribution as a product. This means when you take the log when you are computing the KL between the approx. posterior and the prior, you can write it as a sum of the KL terms between each parameter. Since the KL is >= 0, for each parameter you add to your model you will be adding another positive term to your ELBO. This is likely why your loss is so much more for your 3D model, likely because there is more parameters.
Another reason this could occur is if you have less data (your M is smaller, than the KL term is weighted less).
What is the guideline for getting a proper value for this coefficient? Maybe like the two-loss terms should be the same order of magnitude?
I am unsure of any specific guideline, for training you are interested primarily in the gradients. A large loss does not mean a large gradient. Have a look at the gradients contributed by the negative log likelihood and the KL term in your ELBO. If the KL term is too large, you probably need a more informative prior or more data (you could simply scale the KL term but this feels a bit yucky for the Bayesian in me).
The current coefficient only takes care of the number of training samples, but not the network complexity or the number of parameters in the network, which I assume the KL loss increase with the complexity of the model.
Yes, as stated before, in general, more parameters == greater ELBO (for a mean-field approach as used in Bayes by Backprop).
I am trying to implement a neural network with the KL loss, without using keras.model.losses, as some software production and hardware support limitation. I am trying to train my model with TF 1.10 and TFP 0.3.0., the issue is that for tf<=1.14, tf.keras.model does not support tf.layers inside the Keras model, so I can't use my original model straight away. Is there a way to get the KL loss, not from model.losses, but from layers or weights of the network in a TF construct?
I am unsure about the best way to tackle this part of it. I would be cautious about going to older versions where it isn't explicitly supported. They put those warnings/exceptions in for a reason.
Is batch normalization or group normalization still helpful in Bayesian deep learning?
For variational inference (as done in Bayes by Backprop) Batchnorm is fine. For sampling methods such as MCMC, Batch normalization is no longer suitable. Have a look at https://arxiv.org/pdf/1908.03491v1.pdf for info on suitability for batch norm with sampling methods for approx. Bayesian inference.
The standard supervised classification setup: we have a bunch of samples, each with the correct label out of N labels. We build a NN with N outputs, transform those to probabilities with softmax, and the loss is the mean cross-entropy between each NN output and the corresponding true label, represented as a 1-hot vector with 1 in the true label and 0 elsewhere. We then optimize this loss by following its gradient. The classification error is used just to measure our model quality.
HOWEVER, I know that when doing policy gradient we can use the likelihood ratio trick, and we no longer need to use cross-entropy! our loss simply tf.gather the NN output corresponding to the correct label. E.g. this solution of OpenAI gym CartPole.
WHY can't we use the same trick when doing supervised learning? I was thinking that the reason we used cross-entropy is because it is differentiable, but apparently tf.gather is differentiable as well.
I mean - IF we measure ourselves on classification error, and we CAN optimize for classification error as it's differentiable, isn't it BETTER to also optimize for classification error instead of this weird cross-entropy proxy?
Policy gradient is using cross entropy (or KL divergence, as Ishant pointed out). For supervised learning tf.gather is really just implementational trick, nothing else. For RL on the other hand it is a must because you do not know "what would happen" if you would execute other action. Consequently you end up with high variance estimator of your gradients, something that you would like to avoid for all costs, if possible.
Going back to supervised learning though
CE(p||q) = - SUM_i q_i log p_i
Lets assume that q_i is one hot encoded, with 1 at k'th position, then
CE(p||q) = - q_k log p_k = - log p_k
So if you want, you can implement this as tf.gather, it simply does not matter. The cross-entropy is simply more generic because it handles more complex targets. In particular, in TF you have sparse cross entropy which does exactly what you describe - exploits one hot encoding, that's it. Mathematically there is no difference, there is small difference computation-wise, and there are functions doing exactly what you want.
Minimization of cross-entropy loss minimizes the KL divergence between the predicted distribution and the target distribution. Which is indeed the same as maximizing the likelihood of the predicted distribution.
I'm trying to train a convolutional neural network with triplet loss (more about triplet loss here) in order to generate face embeddings (128 values that accurately describe a face).
In order to select only semi-hard triplets (distance(anchor, positive) < distance(anchor, negative)), I first feed all values in a mini-batch and calculate the distances:
distance1, distance2 = sess.run([d_pos, d_neg], feed_dict={x_anchor:input1, x_positive:input2, x_negative:input3})
Then I select the indices of the inputs with distances that respect the formula above:
valids_batch = compute_valids(distance1, distance2, batch_size)
The function compute_valids:
def compute_valids(distance1, distance2, batch_size):
valids = list();
for q in range(0, len(distance1)):
if(distance1[q] < distance2[q]):
valids.append(q)
return valids;
Then I learn only from the training examples with indices returned by this filter function:
input1_valid = [input1[q] for q in valids_batch]
input2_valid = [input2[q] for q in valids_batch]
input3_valid = [input3[q] for q in valids_batch]
_, loss_value, summary = sess.run([optimizer, cost, summary_op], feed_dict={x_anchor:input1_valid, x_positive:input2_valid, x_negative:input3_valid})
Where optimizer is defined as:
model1 = siamese_convnet(x_anchor)
model2 = siamese_convnet(x_positive)
model3 = siamese_convnet(x_negative)
d_pos = tf.reduce_sum(tf.square(model1 - model2), 1)
d_neg = tf.reduce_sum(tf.square(model1 - model3), 1)
cost = triplet_loss(d_pos, d_neg)
optimizer = tf.train.AdamOptimizer(learning_rate = 1e-4).minimize( cost )
But something is wrong because accuracy is very low (50%).
What am I doing wrong?
There are a lot of reasons why your network is performing poorly. From what I understand, your triplet generation method is fine. Here are some tips that may help improve your performance.
The model
In deep metric learning, people usually use some pre-trained models on ImageNet classification task as these models are pretty expressive and can generate good representation for image. You can fine-tuning your model on the basis of these pre-trained models, e.g., VGG16, GoogleNet, ResNet.
How to fine-tuing
Even if you have a good pre-trained model, it is often difficult to directly optimize the triplet loss using these model on your own dataset. Since these pre-trained models are trained on ImageNet, if your dataset is vastly different from ImageNet, you can first fine-tuning the model using classification task on your dataset. Once your model performs reasonably well on the classification task on your custom dataset, you can use the classification model as base network (maybe a little tweak) for triplet network. It will often lead to much better performance.
Hyper parameters
Hyper parameters such as learning rate, momentum, weight_decay etc. are also extremely important for good performance (learning rate maybe the most important factor). Since your are fine-tuning and not training the network from scratch. You should use a small learning rate, for example, lr=0.001 or lr=0.0001. For momentum, 0.9 is a good choice. For weight_decay, people usually use 0.0005 or 0.00005.
If you add some fully connected layers, then for these layers, the learning rate may be higher than other layers (0.01 for example).
Which layer to fine-tuing
As your network has several layers, you need to decide which layer to fine-tune. Researcher have found that the lower layers in network just produce some generic features such as line or edges. Typically, people will freeze the updating of lower layers and only update the weight of upper layers which tend to produce task-oriented features. You should try to optimize starting from different lower layers and see which setting performs best.
Reference
Fast rcnn(Section 4.5, which layers to fine-tune)
Deep image retrieval(section 5.2, Influence of fine-tuning the representation)
distance(anchor, positive) < distance(anchor, negative)
This will select triplets in which similarity between anchor and positive is more than anchor and negative, it is opposite of hard triplet. You need to use examples where d(a,p)>d(a,n) for hard triplets. For semi-hard triplets, you need examples that satisfy d(a,p)<d(a,n)<d(a,p)+margin.
Here is the explanation : https://stackoverflow.com/a/49314187/7693521
I hope I am correct about this, if not please correct me.