I'm training my own music_vae models and I've noticed that the evaluation loss and accuracy is always 0 when viewed in tensorboard. This is strange because I follow a similar process to train the RNNs, but the evaluation loss and accuracy look good with the RNNs.
Here is what I'm seeing in tensorboard:
tensorboard legend
evaluation loss
accuracy
Finally, here is what I'm seeing inside the eval folder. As you can see there is some data there:
Eval folder contents
Any help on this issue would be appreciated! Thanks
Related
When I train a TensorFlow model, it usually prints information similar to the below line at each iteration
INFO:tensorflow:loss = 1.9433185, step = 11 (0.300 sec)
Is the loss being printed the loss of the batch that the model saw currently, or is it the running average loss over all the previous batches of the training?
If I use a batch size of 1 i.e. only one training sample in each batch, then the loss printed will be of every sample separately, or will it be a running average loss?
The loss reported in the progress bar of Keras/TensorFlow is always a running mean of the batches seen so far, it is not a per-batch value.
I do not think there is a way to see the per-batch values during training.
I'm training my model via model.fit() in Keras. I stopped the training, by interrupting it, or even because it is done, and then changed the batch_size and decided to go with more training. Here is what's happening:
The loss when the training was stopped/finished = 26
The loss when the training proceeded = 46
Meanining that I lost all the progress I made and it is as if I'ms starting over.
It does procceed from where it left only if I don't change anything. But if I changed the batch size, it is as if the optimizer re-initializes my weights and throw out my progress. How can I get a handle on what the optimizer is doing without my consent ?
You most likely have some examples that give you large loss values. MSE makes this worse. When batch size is larger then you are probably getting a lot of these outliers in your batch. You can look at the top loss contributing examples.
As a learning exercise, I'm training the Inception (v2) model from scratch using the ImageNet dataset from the Kaggle competition. I've heard people say it took them a week or so of training on a GPU to converge this model in this same dataset. I'm currently training it on my MacBook Pro (single CPU), so I'm expecting it to converge in no less than a month or so.
Here's my implementation of the Inception model. Input is 224x224x3 images, with values in range [0, 1].
The learning rate was set to a static 0.01 and I'm using the stochastic gradient descent optimizer.
My question
After 48 hours of training, the training loss seems to indicate that it's learning from the training data, but the validation loss is beginning to get worse. Ordinarily, this would feel like the model is overfitting. Does it look like something might be wrong with my model or dataset, or is this perfectly expected, since I've only trained 5.8 epochs?
My training and validation loss and accuracy after 1.5 epochs.
Training and validation loss and accuracy after 5.8 epochs.
Some input images as seen by the model, as well as the output of one of the early convolution layers.
all,
I start the training process using deeplab v3+ following this guide. However, after step 1480, I got the error:
Error reported to Coordinator: Nan in summary histogram for: image_pooling/BatchNorm/moving_variance_2
The detailed train log is here
Could someone suggest how to solve this issue? THX!
Based on the log, it seems that you are training with batch_size = 1, fine_tune_batch_norm = True (default value). Since you are fine-tuning batch norm during training, it is better to set batch size as large as possible (see comments in train.py and Q5 in FAQ). If only limited GPU memory is available, you could fine-tune from the provided pre-trained checkpoint, set smaller learning rate and fine_tune_batch_norm = False (see model_zoo.md for details). Note make sure the flag tf_initial_checkpoint has correct path to the desired pre-trained checkpoint.
I am training faster rcnn model on fruit dataset using a pretrained model provided in google api(faster_rcnn_inception_resnet_v2_atrous_coco).
I made few changes to the default configuration. (number of classes : 12 fine_tune_checkpoint: path to the pretrained checkpoint model and from_detection_checkpoint: true). Total number of annotated images I have is around 12000.
After training for 9000 steps, the results I got have an accuracy percent below 1, though I was expecting it to be atleast 50% (In evaluation nothing is getting detected as accuracy is almost 0). The loss fluctuates in between 0 and 4.
What should be the number of steps I should train it for. I read an article which says to run around 800k steps but its the number of step when you train from scratch?
FC layers of the model are changed because of the different number of the classes but it should not effect those classes which are already present in the pre-trained model like 'apple'?
Any help would be much appreciated!
You shouldn't look at your training loss to determine when to stop. Instead, you should run your model through the evaluator periodically, and stop training when the evaluation mAP stops improving.