How to minimize two loss using TensorFlow? - tensorflow

I am working on a project which is to localize object in a image. The method I am going to adopt is based on the localization algorithm in CS231n-8.
The network structure has two optimization heads, classification head and regression head. How can I minimize both of them when training the network?
I have one idea that summarizing both of them into one loss. But the problem is classification loss is softmax loss and regression loss is l2 loss, which means they have different range. I don't think this is the best way.

It depends on your network status.
If your network is just able to extract features [you're using weights kept from some other net], you can set this weights to be constants and then train separately the two classification heads, since the gradient will not flow trough the constants.
If you're not using weights from a pre-trained model, you
Have to train the network to extract features: thus train the network using the classification head and let the gradient flow from the classification head to the first convolutional filter. In this way your network now can classify objects combining the extracted features.
Convert to constant tensors the learned weights of the convolutional filters and the classification head and train the regression head.
The regression head will learn to combine the features extracted from the convolutional layer adapting its parameters in order to minimize the L2 loss.
Tl;dr:
Train the network for classification first.
Convert every learned parameter to a constant tensor, using graph_util.convert_variables_to_constants as showed in the 'freeze_graph` script.
Train the regression head.

Related

Using multiple losses and multiple training steps in a TF2 model using subclassing?

I am implementing a generative adversarial autoencoder in TF2. I have got it working but not optimally and could use some high level advise to improve it.
The model is inspired by the paper “Adversarial Factorization Autoencoder for Look-alike Modeling” https://dmkd.cs.vt.edu/papers/CIKM19.pdf
The model consists of three parts: an encoder/generator, a decoder and a discriminator.
I have implemented these three parts as custom classes, each subclassing from tf.keras.Model (which is primary source of my issues). I have three different loss functions (autoencoder loss, generator loss, discriminator loss) and two custom training step functions.
The first training function trains first the autoencoder (using the autoencoder loss function) and then the generator (using the generator loss function). The generator is just the encoder part of the autoencoder but having the double purpose of also fooling the discriminator).
The second training function trains the discriminator using the discriminator loss function.
This approach works ok but subclassing all three parts from tf.keras.Model has limitations. I can’t utilize keras compile and fit functionalities. Callbacks are a nightmare and I really do need early stopping, keep best model, tensorboard integration and so on. It appears the best approach is to subclass each part from tf.keras.layers.Layer and then combine them in a single custom model. But I am not sure if it is at all possible to wire up multiple loss functions and multiple training steps to different layer blocks of a custom model?
Any hints and insights are greatly appreciated.

How to understand LSTM performance generated on a graph?

I used LSTM by applying Keras for the task of Named Entity Recognition. The model present around 95% accuracy. Then I tried to create the graph for accuracy and loss. Now I do know, what do these two graphs present?

How to best transfer learning using Dopamine for Reinforcement Learning?

I am using Google's Dopamine framework to train a specific reinforcement learning use-case. I am using an auto encoder to pre-train the convolutional layers of the Deep Q Network and then transfer those pre-trained weights in the final network.
To that end, I have created a separate model (in this case an auto-encoder) which I train and save the resulting model and weights.
The DQN model is created using Keras's model sub-classing method and the model used to save the trained convolutional layers weights was build using the Sequential API. My issue is with when trying to load the pre-trained weights to my final DQN model. Based on whether I use the load_model() or load_weights() functionality from Tensorflow's API I get two different overall behaviors of my network and I would like to understand why. Specifically I have the two following scenarios:
Loading the weights with theload_weights() method to the final model. The weights are the weights of the encoder plus one additional layer(added just before saving the weights) to fit the architecture of the final network implemented in dopamine where they are loaded.
First load the saved model with load_model() and then when defining the new model in the __init__() method, extract the relevant layers from the loaded model and then use them for the final model.
Overall, I would expect the two approaches to yield similar results with regards to the average reward achieved per episode , when I use the same pre-trained weights. However the two approaches differ ( 1. yield higher average reward than 2. although using the same pre-trained weights) and I don't understand why.
Furthermore, in order to validate this behavior I have tried loading random weights with the two aforementioned approaches in order to see a change in behavior. In both cases, based on which of the two aforementioned loading methods I am using, I end up with very similar resulting behavior with the respected case when loading the trained weights. It's seems like the pre-trained weights in each respected case have no effect on the overall resulting training behavior. Although, this might be irrelevant to the issue I am trying to investigate here as it might be the case that the pre-trained weights don't offer any benefit overall which is also possible.
Any thoughts and ideas on this would be much appreciated.

Why does the Gramian Matrix work for VGG16 but not for EfficientNet or MobileNet?

A Neural Algorithm of Artistic Style uses the Gramian Matrix of the intermediate feature vectors of the VGG16 classification network trained on ImageNet. Back then, that was probably a good choice because VGG16 was one of the best-performing classification. Nowadays, there are much more efficient classification networks that surpass VGG in classification performance while requiring fewer parameters and FLOPS, for example EfficientNet and MobileNetv2.
But when I tried this out in practice, the Gramian Matrix for VGG16 features appears representative of the image style in that its L2 distance for stylistically similar images is smaller than the L2 distance to stylistically unrelated images. For the Gramian Matrix calculated from EfficientNet and MobileNetv2 features, that does not appear to be the case. The L2 distance between very similar images and between very dissimilar images only varies by about 5%.
From the network structure, VGG, EfficientNet, and MobileNet all have convolutions with batch normalization and ReLU in between, so the building blocks are the same. Then which design decision is unique to VGG so that its Gramian Matrix captures the style, while EfficientNet's and MobileNet's do not?
By now, I figured it out: The Gramian Matrix needs partially correlated features to work correctly. Newer networks are trained with a Dropout regularizer, which will reduce the inter-feature correlation.

Can I train Keras/TF model layer by layer?

I am looking to train a large face identification network. Resnet or VGG-16/19. TensorFlow 1.14
My question is - if I run out of GPU memory - is it valid strategy to train sets of layers one by one?
For example train 2 cnn and maxpooling layer as one set, then "freeze the weights" somehow and train next set etc..
I know I can train on multi-gpu in tensorflow but what if I want to stick to just one GPU..
The usual approach is to use transfer learning: use a pretrained model and fine-tune it for the task.
For fine-tuning in computer vision, a known approach is re-training only the last couple of layers. See for example:
https://www.learnopencv.com/keras-tutorial-fine-tuning-using-pre-trained-models/
I may be wrong but, even if you freeze your weights, they still need to be loaded into the memory (you need to do whole forward pass in order to compute the loss).
Comments on this are appreciated.