Does the gradient flow between batches in keras LSTM with stateful=True?

Does the gradient flow between batches in keras LSTM with stateful=True? - tensorflow

In a reinforcement learning setting, I do not have the whole dataset during training.
I take only one timestep as input.
With the option stateful=True, does the gradient flow between batches in backpropagation?
If this is not the case, how would one implement this in TensorFlow?

Related

keras prioritizes metrics or loss?

I'm struggling with understanding how keras model works.
When we train model, we give metrics(like ['accuracy']) and loss function(like cross-entropy) as arguments.
What I want to know is which is the goal for model to optimize.
After fitting, leant model maximize accuracy? or minimize loss?

The model optimizes the loss; Metrics are only there for your information and reporting results.
https://en.wikipedia.org/wiki/Loss_function
Note that metrics are optional, but you must provide a loss function to do training.
You can also evaluate a model on metrics that were not added during training.

Keras models work by minimizing the loss by adjusting trainable model parameters via back propagation. The metrics such as training accuracy, validation accuracy etc are provided as information but can also be used to improve your model's performance through the use of Keras callbacks. Documentation for that is located here. For example the callback ReduceLROnPlateau (documentation is here) can be used to monitor a metric like validation loss and adjust the model's learning rate if the loss fails to decrease after a certain number (patience parameter) of consecutive epochs.

Exploding gradients in an LSTM network

I have created my own lstm neural network in vb.net. From what I've read lstm networks are not meant to suffer from the effects of exploding/vanishing gradients. However, after a while all the gradients will increase to the maximum. Changing the rate only affects the time taken for this to happen. Is there anything that can cause exploding gradients in an lstm network?
I'm using RMSProp with momentum to update weights with a sequence size ranging from 32 to 64. Also includes peephole connectors with the training data in the range of [0,1].
I based it off the paper, LSTM: A space search odyssey

I had the same problem with a LSTM in pytorch. It helped to clip the gradient.
Additionally you could try to change the learning rate.

Printing initial loss when training in Keras with Tensorflow backend

I am training my Deep neural network on Keras (TF backend). I just want to print the first loss while training DNN. I just want to make sure that my initialization is correct and so I need the initial loss calculated by the DNN after making the first forward pass.
Keras callback allows us to determine loss after every epoch. I want it after first training step.

You can create a custom callback using on_batch_end
https://keras.io/callbacks/

Behavior of Dropout layers in test / training phase

According to the Keras documentation dropout layers show different behaviors in training and test phase:
Note that if your model has a different behavior in training and
testing phase (e.g. if it uses Dropout, BatchNormalization, etc.), you
will need to pass the learning phase flag to your function:
Unfortunately, nobody talks about the actual differences. Why should dropout behave differently in test phase? I expect the layer to set a certain amount of neurons to 0. Why should this behavior depend on the training/test phase?

Dropout is used in the training phase to reduce the chance of overfitting. As you mention this layer deactivates certain neurons. The model will become more insensitive to weights of other nodes. Basically with the dropout layer the trained model will be the average of many thinned models. Check a more detailed explanation here
However, in when you apply your trained model you want to use the full power of the model. You want to use all neurons in the trained (average) network to get the highest accuracy.

How to minimize two loss using TensorFlow?

I am working on a project which is to localize object in a image. The method I am going to adopt is based on the localization algorithm in CS231n-8.
The network structure has two optimization heads, classification head and regression head. How can I minimize both of them when training the network?
I have one idea that summarizing both of them into one loss. But the problem is classification loss is softmax loss and regression loss is l2 loss, which means they have different range. I don't think this is the best way.

It depends on your network status.
If your network is just able to extract features [you're using weights kept from some other net], you can set this weights to be constants and then train separately the two classification heads, since the gradient will not flow trough the constants.
If you're not using weights from a pre-trained model, you
Have to train the network to extract features: thus train the network using the classification head and let the gradient flow from the classification head to the first convolutional filter. In this way your network now can classify objects combining the extracted features.
Convert to constant tensors the learned weights of the convolutional filters and the classification head and train the regression head.
The regression head will learn to combine the features extracted from the convolutional layer adapting its parameters in order to minimize the L2 loss.
Tl;dr:
Train the network for classification first.
Convert every learned parameter to a constant tensor, using graph_util.convert_variables_to_constants as showed in the 'freeze_graph` script.
Train the regression head.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Does the gradient flow between batches in keras LSTM with stateful=True? - tensorflow

In a reinforcement learning setting, I do not have the whole dataset during training. I take only one timestep as input. With the option stateful=True, does the gradient flow between batches in backpropagation? If this is not the case, how would one implement this in TensorFlow?

Related

keras prioritizes metrics or loss?

Exploding gradients in an LSTM network

Printing initial loss when training in Keras with Tensorflow backend

Behavior of Dropout layers in test / training phase

How to minimize two loss using TensorFlow?

Categories

Resources