How to understand LSTM performance generated on a graph? - matplotlib

I used LSTM by applying Keras for the task of Named Entity Recognition. The model present around 95% accuracy. Then I tried to create the graph for accuracy and loss. Now I do know, what do these two graphs present?

Related

Using multiple losses and multiple training steps in a TF2 model using subclassing?

I am implementing a generative adversarial autoencoder in TF2. I have got it working but not optimally and could use some high level advise to improve it.
The model is inspired by the paper “Adversarial Factorization Autoencoder for Look-alike Modeling” https://dmkd.cs.vt.edu/papers/CIKM19.pdf
The model consists of three parts: an encoder/generator, a decoder and a discriminator.
I have implemented these three parts as custom classes, each subclassing from tf.keras.Model (which is primary source of my issues). I have three different loss functions (autoencoder loss, generator loss, discriminator loss) and two custom training step functions.
The first training function trains first the autoencoder (using the autoencoder loss function) and then the generator (using the generator loss function). The generator is just the encoder part of the autoencoder but having the double purpose of also fooling the discriminator).
The second training function trains the discriminator using the discriminator loss function.
This approach works ok but subclassing all three parts from tf.keras.Model has limitations. I can’t utilize keras compile and fit functionalities. Callbacks are a nightmare and I really do need early stopping, keep best model, tensorboard integration and so on. It appears the best approach is to subclass each part from tf.keras.layers.Layer and then combine them in a single custom model. But I am not sure if it is at all possible to wire up multiple loss functions and multiple training steps to different layer blocks of a custom model?
Any hints and insights are greatly appreciated.

How to best transfer learning using Dopamine for Reinforcement Learning?

I am using Google's Dopamine framework to train a specific reinforcement learning use-case. I am using an auto encoder to pre-train the convolutional layers of the Deep Q Network and then transfer those pre-trained weights in the final network.
To that end, I have created a separate model (in this case an auto-encoder) which I train and save the resulting model and weights.
The DQN model is created using Keras's model sub-classing method and the model used to save the trained convolutional layers weights was build using the Sequential API. My issue is with when trying to load the pre-trained weights to my final DQN model. Based on whether I use the load_model() or load_weights() functionality from Tensorflow's API I get two different overall behaviors of my network and I would like to understand why. Specifically I have the two following scenarios:
Loading the weights with theload_weights() method to the final model. The weights are the weights of the encoder plus one additional layer(added just before saving the weights) to fit the architecture of the final network implemented in dopamine where they are loaded.
First load the saved model with load_model() and then when defining the new model in the __init__() method, extract the relevant layers from the loaded model and then use them for the final model.
Overall, I would expect the two approaches to yield similar results with regards to the average reward achieved per episode , when I use the same pre-trained weights. However the two approaches differ ( 1. yield higher average reward than 2. although using the same pre-trained weights) and I don't understand why.
Furthermore, in order to validate this behavior I have tried loading random weights with the two aforementioned approaches in order to see a change in behavior. In both cases, based on which of the two aforementioned loading methods I am using, I end up with very similar resulting behavior with the respected case when loading the trained weights. It's seems like the pre-trained weights in each respected case have no effect on the overall resulting training behavior. Although, this might be irrelevant to the issue I am trying to investigate here as it might be the case that the pre-trained weights don't offer any benefit overall which is also possible.
Any thoughts and ideas on this would be much appreciated.

Predicting using pre-trained model in tf.keras

What is the difference between rescaling and not rescaling images for predicting using a tf.keras Resnet50 pre-trained on ImageNet?
Is it necessary? How much of an impact does it have on the predictions?
It is the difference between the model working as expected, and not working at all, usually if you do not apply the proper normalization that was applied to the training set, then the model performs weird, like always producing the same output, which is undesirable.
So always use the exact same scaling and normalization used to train a model.

TensorFlow model saving to be approached differently during training Vs. deployment?

Assume that I have a CNN which I am training on some dataset. The most important part of the model is the CNN architecture.
Now when I write a code, I define the model structure in a Python class. However, outside that class, I define a number of other nodes such as loss, accuracy, tf.Variable to keep count of epochs and so on.
When I am training, for properly resuming the training, I'd like to save all these nodes (e.g - loss, epoch variable etc), and not just the CNN structure.
However, once I am done with training, I would like to save only the CNN architecture and no nodes for loss, accuracy etc. This is because it will enable people using my model to exercise freedom in writing their own finetuning codes.
How to achieve this in TF code ? Can someone show an example ?
Is this approach towards saving followed by others also ? I just want to know if my approach is right.

How to minimize two loss using TensorFlow?

I am working on a project which is to localize object in a image. The method I am going to adopt is based on the localization algorithm in CS231n-8.
The network structure has two optimization heads, classification head and regression head. How can I minimize both of them when training the network?
I have one idea that summarizing both of them into one loss. But the problem is classification loss is softmax loss and regression loss is l2 loss, which means they have different range. I don't think this is the best way.
It depends on your network status.
If your network is just able to extract features [you're using weights kept from some other net], you can set this weights to be constants and then train separately the two classification heads, since the gradient will not flow trough the constants.
If you're not using weights from a pre-trained model, you
Have to train the network to extract features: thus train the network using the classification head and let the gradient flow from the classification head to the first convolutional filter. In this way your network now can classify objects combining the extracted features.
Convert to constant tensors the learned weights of the convolutional filters and the classification head and train the regression head.
The regression head will learn to combine the features extracted from the convolutional layer adapting its parameters in order to minimize the L2 loss.
Tl;dr:
Train the network for classification first.
Convert every learned parameter to a constant tensor, using graph_util.convert_variables_to_constants as showed in the 'freeze_graph` script.
Train the regression head.