I'm running tensorflow on GPU for training. I have a 1 layer GRU cell, with 800 batch size and I do 10 epochs. I see this spikes in the accuracy graph from tensorboard and I do not understand why. See the image.
If you count the spikes, they are 10, as the number of epochs. I tried this with different configurations, reducing batch size, increasing number of layers but the spikes are still there.
You can find the code here if it helps.
I use tf.RandomShuffleQueue for the data with infinite epochs, and I calculate how many steps it should do. I do not think that the problem is on how I calculate the accuracy (here). Do you have any suggestions why this happens?
EDIT
min_after_dequeue=2000
Related
I am using this https://github.com/tensorflow/models/tree/master/official/resnet official tensorflow implementation of resnet to train a binary classifier on my own dataset. I modified a little bit of the input_fn in imagenet_main.py to do my own image loading and preprocessing. But after many times of parameter tuning, I can't make my model train properly. I can only find a set of parameters that let training accuracy increase reaching 100%, while the validation accuracy stay around 50% forever. The implementation uses piece-wise learning-rate. I tried initial learning rate from 0.1 to 1e-5 and weight decay from 1e-2 to 1e-5, and no convergence on validation set was found.
A suspicious observation is that during training, the l2 loss decrease slowly and steady while cross-entropy is very reluctant to decrease, staying around 0.69.
Any idea about what can I try further ?
Regarding my dataset and image preprocessing, The training data set is around 100K images. The validation set is around 10K. I just resize each image to 224*224 while keeping aspect ration and subtract 127 on each channel and divide them by 255.
Actually #Hua resnet have so many trainable parameters and it is trained on image net which has 1k classes. and your data-set has only two classes. Dense layers of resnet has 4k neurons which in result increase the number of trainable parameter. Now number of parameters are directly related to risk of over-fitting. Means that resnet model is not suitable for your data kindly make some changes to resnet. Try to decrease number of parameter. That may help –
I was writing a neural net to train Resnet on CIFAR-10 dataset.
The paper Deep Residual Learning For Image Recognition mentions training for around 60,000 epochs.
I was wondering - what exactly does an epoch refer to in this case? Is it a single pass through a minibatch of size 128 (which would mean around 150 passes through the entire 50000 image training set?
Also how long is this expected to take to train(assume CPU only, 20-layer or 32-layer ResNet)? With the above definition of an epoch, it seems it would take a very long time...
I was expecting something around 2-3 hours only, which is equivalent to about 10 passes through the 50000 image training set.
The paper never mentions 60000 epochs. An epoch is generally taken to mean one pass over the full dataset. 60000 epochs would be insane. They use 64000 iterations on CIFAR-10. An iteration involves processing one minibatch, computing and then applying gradients.
You are correct in that this means >150 passes over the dataset (these are the epochs). Modern neural network models often take days or weeks to train. ResNets in particular are troublesome due to their massive size/depth. Note that in the paper they mention training the model on two GPUs which will be much faster than on the CPU.
If you are just training some models "for fun" I would recommend scaling them down significantly. Try 8 layers or so; even this might be too much. If you are doing this for research/production use, get some GPUs.
I'm using Tensorflow RNN to predict a bunch of sequences. I use Grucell and dynamic_rnn. While training, I input Training dataset, which I separate into 8 batches, each batch has batchsize of 10 (1 batch has shape of [10, 6, 2], which is [batchsize, seqlen, dim]). And to prevent overfitting, I stop training when prediction rate in Training dataset starts to exceed 80% (usually stops at accuracy of 80%~83%).
After training, I let the same graph to just predict (not train) the same Training dataset. But this time, since tf.nn.dynamic_rnn makes it possible to feed batches of variable size, I can tailor the dataset into 80 batches, each batch has batchsize of 1, and shape of [1, 10, 2] (simply lowered batchsize and therefore increased number of batches). Then, Accuracy usually exceeds 90%, which is appreciably higher than 80%. For some reason, Shrinking batchsize leads to higher accuracy rate. Why this happens?
As I understand you have less amount of training data and you also early stopping during training, Don't stop blindly for handling overfitting check your training and validation loss difference if the difference is increasing that means your model is started overfitting. And before training check, your dataset is it biased or correctly balanced. I think this is happening because you have very less training data or your data set is not balanced.
I have trained a face recognition model with tensorflow (4301 classes). The training process goes like follows(I have grab the chart of the training process):
training accuracy
training loss
The training accuracy steadily increases, However, for the training loss, it firstly decreases, then after a certain number of iterations, it weirdly increases.
I simply use softmax loss with weights regularizer. And I use AdamOptimizer to minimize the loss. For learning rate setting, the initial lr is set to 0.0001, the learning rate would decrease by half by every 7 epocs(380000 training images total, batch size is 16). And I have test on a validation set (consist 8300 face images),and get a validation accuracy about 55.0% which is far below the training accuracy.
Is it overfitting ? can overfitting leads to a final increase for the training loss?
Overfitting is when you start having a divergence in the performance on training and test data — this is not the case here since you are reporting training performance only.
Training is running a minimization algorithm on your loss. When your loss starts increasing, it means that training fails at what it is supposed to be doing. You probably want to change your minimization settings to get your training loss to eventually converge.
As for why your accuracy continues to increase long after your loss starts diverging, it is hard to tell without knowing more. An explanation could be that your loss is a sum of different terms, for example a cross-entropy term and a regularization term, and that only the later diverges.
My team is training a CNN in Tensorflow for binary classification of damaged/acceptable parts. We created our code by modifying the cifar10 example code. In my prior experience with Neural Networks, I always trained until the loss was very close to 0 (well below 1). However, we are now evaluating our model with a validation set during training (on a separate GPU), and it seems like the precision stopped increasing after about 6.7k steps, while the loss is still dropping steadily after over 40k steps. Is this due to overfitting? Should we expect to see another spike in accuracy once the loss is very close to zero? The current max accuracy is not acceptable. Should we kill it and keep tuning? What do you recommend? Here is our modified code and graphs of the training process.
https://gist.github.com/justineyster/6226535a8ee3f567e759c2ff2ae3776b
Precision and Loss Images
A decrease in binary cross-entropy loss does not imply an increase in accuracy. Consider label 1, predictions 0.2, 0.4 and 0.6 at timesteps 1, 2, 3 and classification threshold 0.5. timesteps 1 and 2 will produce a decrease in loss but no increase in accuracy.
Ensure that your model has enough capacity by overfitting the training data. If the model is overfitting the training data, avoid overfitting by using regularization techniques such as dropout, L1 and L2 regularization and data augmentation.
Last, confirm your validation data and training data come from the same distribution.
Here are my suggestions, one of the possible problems is that your network start to memorize data, yes you should increase regularization,
update:
Here I want to mention one more problem that may cause this:
The balance ratio in the validation set is much far away from what you have in the training set. I would recommend, at first step try to understand what is your test data (real-world data, the one your model will face in inference time) descriptive look like, what is its balance ratio, and other similar characteristics. Then try to build such a train/validation set almost with the same descriptive you achieve for real data.
Well, I faced the similar situation when I used Softmax function in the last layer instead of Sigmoid for binary classification.
My validation loss and training loss were decreasing but accuracy of both remained constant. So this gave me lesson why sigmoid is used for binary classification.