Tensorboard: Why is there a zigzag pattern at gradient plots? - tensorflow

Here is a picture of the gradient of a conv2d layer (the kernel). It has a zigsag pattern which I would like to understand. What I understand is that the gradient changes from mini-batch to mini-batch. But why does it increase after each epoch?
I am using the Keras Adam optimizer with default settings. I dont think that is the reason. Dropout and Batch-Norm. should also not be the reason. I am using image augmentation but that does not change its behavior from batch to batch.
Does anybody have an idear?

I've seen this before with keras metrics.
In that case the problem was that the metrics maintain a running average across each epoch, and it's that "average so far" that they report to TensorBoard.
How are these grads getting to TensorBoard? Are you passing them to a tf.keras.metrics.Mean? If so you probably want to call "reset_states" on it. Maybe in an custom callback's on_batch_end.

Related

RNN/GRU Increasing validation loss but decreasing mean absolute error

I am new to deep learning and I try to implement an RNN (with 2 GRU layers).
At first, the network seems to do it's job quite fine. However, I am currently trying to understand the loss and accuracy curve. I attached the pictures below. The dark-blue line is the training set and the cyan line is the validation set.
After 50 epochs the validation loss increases. My assumption is that this indicates overfitting. However, I am unsure why the validation mean absolute error still decreases. Do you maybe got an idea?
One idea I had in mind was that this could be caused by some big outliers in my dataset. Thus I already tried to clean it up. I also tried to scale it properly. I also added a few dropout layers for further regularization (rate=0.2). However these are just normal dropout layers because cudnn does not seem to support recurrent_dropout from tensorflow.
Remark: I am using the negative log-likelihood as loss function and a tensorflow probability distribution as the output dense layer.
Any hints what I should investigate?
Thanks in advance
Edit: I also attached the non-probabilistic plot as recommended in the comment. Seems like here the mean-absolute-error behaves normal (does not improve all the time).
What are the outputs of your model? It sounds pretty strange that you're using the negative log-likelihood (which basically "works" with distributions) as the loss function but the MAE as a metric, which is suited for deterministic continuous values.
I don't know what is your task and perhaps this is meaningful in your specific case, but perhaps the strange behavior comes out from there.

What Is Regularisation Loss in TensorFlow API? It Doesn't Align With Any Other Loss Function

I'm training an EfficientDet V7 model using the V2 model zoo and have the following output in TensorBoard:
This is great, you can see that my classification and localisation losses are dropping to low levels (I'll worry about overfitting later if this is a separate issue) - but regularisation loss is still high and this is keeping my total loss at quite high levels. I can't seem to a) find a clear explanation (for a newbie) on what I'm looking at with the regularisaton loss (what does it represent in this context) and b) suggestions as to why it might be so high.
Usually, regularization loss is something like a L2 loss computed on the weights of your neural net. Minimization of this loss tends to shrink the values of the weights.
It is a regularization (hence the name) technique, which can help with such problems as over-fitting (maybe this article can help if you want to know more).
Bottom line: You don't have to do anything about it.

keras prioritizes metrics or loss?

I'm struggling with understanding how keras model works.
When we train model, we give metrics(like ['accuracy']) and loss function(like cross-entropy) as arguments.
What I want to know is which is the goal for model to optimize.
After fitting, leant model maximize accuracy? or minimize loss?
The model optimizes the loss; Metrics are only there for your information and reporting results.
https://en.wikipedia.org/wiki/Loss_function
Note that metrics are optional, but you must provide a loss function to do training.
You can also evaluate a model on metrics that were not added during training.
Keras models work by minimizing the loss by adjusting trainable model parameters via back propagation. The metrics such as training accuracy, validation accuracy etc are provided as information but can also be used to improve your model's performance through the use of Keras callbacks. Documentation for that is located here. For example the callback ReduceLROnPlateau (documentation is here) can be used to monitor a metric like validation loss and adjust the model's learning rate if the loss fails to decrease after a certain number (patience parameter) of consecutive epochs.

Why is L2 regularization not added back into original loss function?

I'm aware that when using a kernal regularizer, particularly, l2 loss, I should bee add it back into the loss function and this is what is being done in other posts. However, in Keras, they are not following this process. Why is this so?
For instance, consider this and this notebook. They are using l2 loss as a kernal regularizer in some layers but not adding back into the original loss. Is this because of the particular loss, or is this a behavior followed in just Keras or am I completely misunderstanding everything?
Keras hides a lot of complexity (and this is not always a good thing).
You're using the Model abstraction: this model contains inside all the required information about the architecture and the training procedure.
When you invoke the method compile or train or train_on_batch you specify the loss function but under the hood what happens is:
Instantiate the loss function specified (e.g. categorical cross entropy)
Fetch from the model the regularizations applied and add all of them to the loss term previously instantiated
You can see the operations that are going to be added to the loss term accessing to the property .losses of the model instance (that's a list of tensorflow operations, usually all multilication operations, since the regularizations are in the for regularization_strenght * norm_p(variable).
The L2 regularization (or any weight regularization) in Keras is still added to the loss function in the same way as you would expect. It just happens behind the scene, so the user doesn't need to worry about it.
The notebooks you linked are the right way to use weight regularization in Keras.

Sequence Labeling in TensorFlow

I have managed to train a word2vec with tensorflow and I want to feed those results into an rnn with lstm cells for sequence labeling.
1) It's not really clear on how to use your trained word2vec model for a rnn. (How to feed the result?)
2) I don't find much documentation on how to implement a sequence labeling lstm. (How do I bring in my labels?)
Could someone point me in the right direction on how to start with this task?
I suggest you start by reading the RNN tutorial and sequence-to-sequence tutorial. They explain how to build LSTMs in TensorFlow. Once you're comfortable with that, you'll have to find the right embedding Variable and assign it using your pre-trained word2vec model.
I realize this was posted a while ago, but I found this Gist about sequence labeling and this Gist about variable sequence labeling really helpful for figuring out sequence labeling. The basic outline (the gist of the Gist):
Use dynamic_rnn to handle unrolling your network for training and prediction. This method has moved around some in the API, so you may have to find it for your version, but just Google it.
Arrange your data into batches of size [batch_size, sequence_length, num_features], and your labels into batches of size [batch_size, sequence_length, num_classes]. Note that you want a label for every time step in your sequence.
For variable-length sequences, pass a value to the sequence_length argument of the dynamic_rnn wrapper for each sequence in your batch.
Training the RNN is very similar to training any other neural network once you have the network structure defined: feed it training data and target labels and watch it learn!
And some caveats:
With variable-length sequences, you will need to build masks for calculating your error metrics and stuff. It's all in the second link above, but don't forget when you make your own error metrics! I ran in to this a couple of times and it made my networks look like they were doing much worse on variable-length sequences.
You might want to add a regularization term to your loss function. I had some convergence issues without this.
I recommend using tf.train.AdamOptimizer with the default settings at first. Depending on your data, this may not converge and you will need to adjust the settings. This article does a good job of explaining what the different knobs do. Start reading from the beginning, some of the knobs are explained before the Adam section.
Hopefully these links are helpful to others in the future!