I am running an image segmentation code on Pytorch, based on the architecture of Linknet.
The optimizer is initially set as:
self.optimizer = torch.optim.Adam(params=self.net.parameters(), lr=lr)
Then I change it to Nesterov to improve the performance, like:
self.optimizer = torch.optim.SGD(params=self.net.parameters(), lr=lr, momentum=0.9, nesterov=True)
However, the performance is poorer using Nesterov. When I use Adam the loss function can converge to 0.19. But the loss function can only converge to 0.34 when I use Nesterov.
By the way, the learning rate is divided by 5 if no decrease of loss in 3 consecutive epochs, and lr can adjust 3 times. After that, the training process finish.
I am wondering why this happens and what should I do for optimization? Thanks a lot for the replys:)
Seems like your question relies on the assumption that SGD with Nesterov would definitely perform better than Adam. However, there is no learning algorithm that is better than another no matter what. You always have to check it given your model (layers, activation functions, loss, etc.) and dataset.
Are you increasing the number of epochs for SGD? Usually, SGD takes much longer to converge than Adam. Note that recent studies show that despite training faster, Adam generalizes worse to the validation and test datasets (https://arxiv.org/abs/1712.07628). An alternative to that is to start the optimization with Adam, and then after some epochs, change the optimizer to SGD.
Related
Hello I just wanted to ask this theoretical question.
What could be the causes of a model that has already a very good loss (0.004 on normalized data) after one single epoch but this loss doesn't really decrease over time (after 10 epochs it's still 0.0032).
Shouldn't it normally decrease way more over time?
The dataset is pretty big with bit more than a million datapoints and I didn't expect this very good loss just after 1 epoch.
So what could I change about this model or what am I doing wrong? (it's a densely connected NN predicting regression with adam and mse)
There are multiple possibilities, but the problem needs some clarification.
Could you specify the range of your target?
0.004 might sound low as a loss, but it's not if your target ranges from 0 to 0.0001 for example.
What are the metrics of your validation & test data set? Loss on itself does not say much without knowing the validation loss.
Guessing that the 0.004 is too good to be true, your model might be over fitting.
Try implementing dropout to avoid over fitting.
In case your model is not over fitting, it might be the case that Adam is overshooting a (local) minima. Try lowering its learning rate, or try sgd with custom hyper-parameters. This does take a lot of tuning.
There is a free course on Coursera called Machine Learning by Stanford. This covers theory on these concepts (and more) in a good way.
Specifically, within one step, how does it training the model? What is the quitting condition for the gradient descent and back propagation?
Docs here: https://www.tensorflow.org/api_docs/python/tf/estimator/Estimator#train
e.g.
mnist_classifier = tf.estimator.Estimator(model_fn=cnn_model_fn)
train_input_fn = tf.estimator.inputs.numpy_input_fn(
x={"x": X_train},
y=y_train,
batch_size=50,
num_epochs=None,
shuffle=True)
mnist_classifier.train(
input_fn=train_input_fn,
steps=100,
hooks=[logging_hook])
I understand that training one step means that we feed the neural network model with batch_size many data points once. My questions is, within this one step, how many times does it perform gradient descent? Does it do back propagation and gradient descent just once or does it keep performing gradient descent until the model weights reach a optimal for this batch of data?
In addition to #David Parks answer, using batches for performing gradient descent is referred to as stochastic gradient descent. Instead of updating the weights after each training sample, you average over the sum of gradients of the batch and use this new gradient to update your weights.
For example, if you have 1000 trainings samples and use batches of 200, you calculate the average gradient for 200 samples, and update your weights with it. That means that you only perform 5 updates overall instead of updating your weights 1000 times. On sufficiently big data sets, you will experience a much faster training process.
Michael Nielsen has a really nice way to explain this concept in his book.
1 step = 1 gradient update. And each gradient update step requires one forward pass and one backward pass.
The stopping condition is generally left up to you and is arguably more art than science. Commonly you will plot (tensorboard is handy here) your cost, training accuracy, and periodically your validation set accuracy. The low point on validation accuracy is generally a good point to stop. Depending on your dataset validation accuracy may drop and at some point increase again, or it may simply flatten out, at which point the stopping condition often correlates with the developer's degree of impatience.
Here's a nice article on stopping conditions, a google search will turn up plenty more.
https://stats.stackexchange.com/questions/231061/how-to-use-early-stopping-properly-for-training-deep-neural-network
Another common approach to stopping is to drop the learning rate every time you compute that no change has occurred to validation accuracy for some "reasonable" number of steps. When you've effectively hit 0 learning rate, you call it quits.
The input function emits batches (when num_epochs=None, num_batches is infinite):
num_batches = num_epochs * (num_samples / batch_size)
One step is processing 1 batch, if steps > num_batches, the training will stop after num_batches.
My team is training a CNN in Tensorflow for binary classification of damaged/acceptable parts. We created our code by modifying the cifar10 example code. In my prior experience with Neural Networks, I always trained until the loss was very close to 0 (well below 1). However, we are now evaluating our model with a validation set during training (on a separate GPU), and it seems like the precision stopped increasing after about 6.7k steps, while the loss is still dropping steadily after over 40k steps. Is this due to overfitting? Should we expect to see another spike in accuracy once the loss is very close to zero? The current max accuracy is not acceptable. Should we kill it and keep tuning? What do you recommend? Here is our modified code and graphs of the training process.
https://gist.github.com/justineyster/6226535a8ee3f567e759c2ff2ae3776b
Precision and Loss Images
A decrease in binary cross-entropy loss does not imply an increase in accuracy. Consider label 1, predictions 0.2, 0.4 and 0.6 at timesteps 1, 2, 3 and classification threshold 0.5. timesteps 1 and 2 will produce a decrease in loss but no increase in accuracy.
Ensure that your model has enough capacity by overfitting the training data. If the model is overfitting the training data, avoid overfitting by using regularization techniques such as dropout, L1 and L2 regularization and data augmentation.
Last, confirm your validation data and training data come from the same distribution.
Here are my suggestions, one of the possible problems is that your network start to memorize data, yes you should increase regularization,
update:
Here I want to mention one more problem that may cause this:
The balance ratio in the validation set is much far away from what you have in the training set. I would recommend, at first step try to understand what is your test data (real-world data, the one your model will face in inference time) descriptive look like, what is its balance ratio, and other similar characteristics. Then try to build such a train/validation set almost with the same descriptive you achieve for real data.
Well, I faced the similar situation when I used Softmax function in the last layer instead of Sigmoid for binary classification.
My validation loss and training loss were decreasing but accuracy of both remained constant. So this gave me lesson why sigmoid is used for binary classification.
I have run deep learning models(CNN's) using tensorflow. Many times during the epoch, i have observed that both loss and accuracy have increased, or both have decreased. My understanding was that both are always inversely related. What could be scenario where both increase or decrease simultaneously.
The loss decreases as the training process goes on, except for some fluctuation introduced by the mini-batch gradient descent and/or regularization techniques like dropout (that introduces random noise).
If the loss decreases, the training process is going well.
The (validation I suppose) accuracy, instead, it's a measure of how good the predictions of your model are.
If the model is learning, the accuracy increases. If the model is overfitting, instead, the accuracy stops to increase and can even start to decrease.
If the loss decreases and the accuracy decreases, your model is overfitting.
If the loss increases and the accuracy increase too is because your regularization techniques are working well and you're fighting the overfitting problem. This is true only if the loss, then, starts to decrease whilst the accuracy continues to increase.
Otherwise, if the loss keep growing your model is diverging and you should look for the cause (usually you're using a too high learning rate value).
I think the top-rated answer is incorrect.
I will assume you are talking about cross-entropy loss, which can be thought of as a measure of 'surprise'.
Loss and accuracy increasing/decreasing simultaneously on the training data tells you nothing about whether your model is overfitting. This can only be determined by comparing loss/accuracy on the validation vs. training data.
If loss and accuracy are both decreasing, it means your model is becoming more confident on its correct predictions, or less confident on its incorrect predictions, or both, hence decreased loss. However, it is also making more incorrect predictions overall, hence the drop in accuracy. Vice versa if both are increasing. That is all we can say.
I'd like to add a possible option here for all those who struggle with a model training right now.
If your validation data is a bit dirty, you might experience that in the beginning of the training the validation loss is low as well as the accuracy, and the more you train your network, the accuracy increases with the loss side by side. The reason why it happens, because it finds the possible outliers of your dirty data and gets a super high loss there. Therefore, your accuracy will grow as it guesses more data right, but the loss grows with it.
This is just what I think based on the math behind the loss and the accuracy,
Note :-
I expect your data is categorical
Your models output :-
[0.1,0.9,0.9009,0.8] (used to calculate loss)
Maxed output :-
[0,0,1,0] (used to calculate acc )
Expected output :-
[0,1,0,0]
Lets clarify what loss and acc calculates :
Loss :- The overall error of y and ypred
Acc :- Just if y and maxed(ypred) is equal
So in a overall our model almost nailed it , resulting in a low loss
But in maxed output no overall is seen its just that they should completely match ,
If they completely match :-
1
else:
0
Thus resulting in a low accuracy too
Try to check mae of the model
remove regularization
check if your are using correct loss
You should check your class index (both train and valid) in training process. It might be sorted in different ways. I have this problem in colab.
When running an existing Tensorflow implementation, I found that the learning rate keeps the same between different epochs. The original implementations uses tf.train.MomentumOptimizer, and has decay rate setup.
My understanding for the momentum optimizer is that learning rate should decrease along with the epochs. Why the learning rate keeps the same for my training process. Is that possible that the learning rate will also depend on the performance, e.g., if the performance does not change quickly, then the learning rate will keep the same. I think I am not very clear about the underlying mechanism of momentum optimizer, and feel confused that the learning rate keeps the same along with the epoch even though I guess it should keep decreasing based on the given decay rate.
The optimizer is defined as follows
learning_rate = 0.2
decay_rate = 0.95
self.learning_rate_node = tf.train.exponential_decay(learning_rate=learning_rate,
global_step=global_step,
decay_steps=training_iters,
decay_rate=decay_rate,
staircase=True)
optimizer = tf.train.MomentumOptimizer(learning_rate=self.learning_rate_node).minimize(self.net.cost,
global_step=global_step)
It is a little bit hard to tell without looking at the code if my answer will be helpful to you or not.
However if you need some insights on how the momentum optimizer works and how the learning rate should decay.
First the Vanilla GradientDescentMinimizer's update which is the most basic:
W^(n+1)=W^(n)-alpha*(gradient of cost wrt W)(W^n)
You are just following the opposite of the gradient.
The GradientDescentMinimizer with learning rate decay:
W^(n+1)=W^(n)-alpha(n)*(gradient of the cost wrt W)(W^n)
The only thing that changed is the learning rate alpha , which is now dependent of the step in Tensorflow the most used is the exponential decay where after N step the learning rate is divided by some constant i.e. 10.
This change often happens later in the training so you might need to let a few epochs pass by before seeing it decay.
The Momentumoptimizer:
you have to keep an additional variable: the update you have done just before i.e you have to store at each time step:
update^(n)=(W^(n)-W^(n-1))
Then the corrected update by momentum is:
update^(n+1)=mupdate^(n)-alpha(gradient of cost wrt W)(W^n)
So what you are doing is doing simple gradient descent but correcting it by remembering the immediate past (There are smarter and more complicated ways of doing it like Nesterov's momentum)
MomentumOptimizer with learning rate decay:
update^(n)=(W^(n)-W^(n-1))
update^(n+1)=mupdate^(n)-alpha(n)(gradient of cost wrt W)(W^n)
alpha is now dependent of n too.
So at one point it will starts slowing down as in gradient descent with learning rate decay but the decrease will be affected by the momentum.
For a complete review of those methods and more you have the excellent website which explains far better than me and Alec Radford's famous visualization which is better than a thousand words.
The learning rate should not depend on the performance unless it is specified in the decay !
It would help to see the code in question !
EDIT1:: Here is a working example that I think answer both questions you asked:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
#Pure SGD
BATCH_SIZE=1
#Batch Gradient Descent
#BATCH_SIZE=1000
starter_learning_rate=0.001
xdata=np.linspace(0.,2*np.pi,1000)[:,np.newaxis]
ydata=np.sin(xdata)+np.random.normal(0.0,0.05,size=1000)[:,np.newaxis]
plt.scatter(xdata,ydata)
x=tf.placeholder(tf.float32,[None,1])
y=tf.placeholder(tf.float32, [None,1])
#We define global_step as a variable initialized at 0
global_step=tf.Variable(0,trainable=False)
w1=tf.Variable(0.05*tf.random_normal((1,100)),tf.float32)
w2=tf.Variable(0.05*tf.random_normal((100,1)),tf.float32)
b1=tf.Variable(np.zeros([100]).astype("float32"),tf.float32)
b2=tf.Variable(np.zeros([1]).astype("float32"),tf.float32)
h1=tf.nn.relu(tf.matmul(x,w1)+b1)
y_model=tf.matmul(h1,w2)+b2
L=tf.reduce_mean(tf.square(y_model-y))
#We want to decrease the learning rate after having seen all the data 5 times
NUM_EPOCHS_PER_DECAY=5
LEARNING_RATE_DECAY_FACTOR=0.1
#Since the mechanism of the decay depends on the number of iterations and not epochs we have to connect the number of epochs to the number of iterations
#So if we have batch_size=1 we have to iterate exactly 1000 times to do one epoch so 5*1000=5000 iterations before decaying if the batch_size was 1000 1 iterations=1epoch and thus we decrease it after 5 iterations
num_batches_per_epoch=int(xdata.shape[0]/float(BATCH_SIZE))
decay_steps=int(num_batches_per_epoch*NUM_EPOCHS_PER_DECAY)
decayed_learning_rate=tf.train.exponential_decay(starter_learning_rate,
global_step,
decay_steps,
LEARNING_RATE_DECAY_FACTOR,
staircase=True)
#So now we have an object that depends on global_step and that will be divided by 10 every decay_steps iterations i.e. when global_step=N*decay_steps with N a non zero integer
#We now create a train_step to which we pass the learning rate created each time this function is called global_step will be incremented by 1 we are gonna check that it is the case BE CAREFUL WE HAVE TO GIVE IT GLOBAL_STEP AS AN ARGUMENT
train_step=tf.train.GradientDescentOptimizer(decayed_learning_rate).minimize(L,global_step=global_step)
sess=tf.Session()
sess.run(tf.initialize_all_variables())
GLOBAL_s=[]
lr_val=[]
COSTS=[]
for i in range(16000):
#We will do 1600 iterations so as there is a decay every 5000 iterations we will see 3 decays (5000,10000,15000)
start_data=(i*BATCH_SIZE)%1000
COSTS.append([sess.run(L, feed_dict={x:xdata,y:ydata})])
GLOBAL_s.append([sess.run(global_step)])
lr_val.append([sess.run(decayed_learning_rate)])
#I see the train_step as implicitely executing sess.run(tf.add(global_step,1))
sess.run(train_step,feed_dict={x:xdata[start_data:start_data+BATCH_SIZE],y:ydata[start_data:start_data+BATCH_SIZE]})
plt.figure()
plt.subplot(211)
plt.plot(GLOBAL_s,lr_val,"-b")
plt.title("Evolution of learning rate" )
plt.subplot(212)
plt.plot(GLOBAL_s,COSTS,".g")
plt.title("Evolution of cost" )
#notice two things first global_step is actually being incremented and learning rate is actually being decayed
(You can writeMomentumOptimize() instead of GradientDescentOptimizer() obviously...)
Here are the two plots I get:
To sum it up in my mind when you call train_step tensorflow runs tf.add(global_step,1)