Why does Tensorflow LearningRateScheduler *increase* (not decrease) the learning rate with each epoch? - tensorflow

On the Coursera class "TensorFlow in Practice -- Sequeneces, Time Series and Prediction" the 9th video of the second week uses a callback to dynamically increase (not decrease) the learning rate. I understand why we need to dynamically adjust the rate; but this callback is increasing the learning rate with each epoch. Don't we want to do the opposite and gradually decrease the learning rate as the neural net learns more? I'm sure the video is correct (it was created by Andrew Ng and Google, who obviously know a lot about TensorFlow) but why are we increasing (instead of decreasing) the learning rate? Is keras actually using the inverse of this number as the learning rate, or something like that?
#Doesn't the next line *increase* the learning rate with each callback?
#But shouldn't we be gradually decreasing it?
lr_schedule = tf.keras.callbacks.LearningRateScheduler(
lambda epoch: 1e-8 * 10**(epoch / 20))
optimizer = tf.keras.optimizers.SGD(lr=1e-8, momentum=0.9)
model.compile(loss="mse", optimizer=optimizer)
history = model.fit(dataset, epochs=100, callbacks=[lr_schedule], verbose=0)
And here's a full code example from the sample notebook that they provide with this example:
https://colab.research.google.com/github/lmoroney/dlaicourse/blob/master/TensorFlow%20In%20Practice/Course%204%20-%20S%2BP/S%2BP%20Week%202%20Lesson%203.ipynb
Is it correct to increase the learning rate with each epoch? Won't that result in the optimizer "over-shooting" the answer on each epoch and never converging to a solution?

You are right. It does not make any sense to actually do this when the goal is to train a network. Might they be doing this to demonstrate that your learning rate can be too high? the graph just after it might be showing such a lesson.

Related

Accuracy Dropped suddenly after certain epoch (Classification using EfficientNet)

So I have been training EfficientNet for a classification task. I used EfficientNet-B2 model with a batch size of 64 and a learning rate 0.0001.
I was able to get good accuracy and loss while gradually increasing batch size and decreasing the learning rate. But when I just used lr 0.0001 and let the model run, I found the accuracy dropping significantly after 26th epoch while the loss curve was following the usual curve.
I have found a good model but just wanted to know what might be the reason for the accuracy behaving like that in the graph.

Keras model has very good loss after 1 epoch but doesn't really get better with more epochs

Hello I just wanted to ask this theoretical question.
What could be the causes of a model that has already a very good loss (0.004 on normalized data) after one single epoch but this loss doesn't really decrease over time (after 10 epochs it's still 0.0032).
Shouldn't it normally decrease way more over time?
The dataset is pretty big with bit more than a million datapoints and I didn't expect this very good loss just after 1 epoch.
So what could I change about this model or what am I doing wrong? (it's a densely connected NN predicting regression with adam and mse)
There are multiple possibilities, but the problem needs some clarification.
Could you specify the range of your target?
0.004 might sound low as a loss, but it's not if your target ranges from 0 to 0.0001 for example.
What are the metrics of your validation & test data set? Loss on itself does not say much without knowing the validation loss.
Guessing that the 0.004 is too good to be true, your model might be over fitting.
Try implementing dropout to avoid over fitting.
In case your model is not over fitting, it might be the case that Adam is overshooting a (local) minima. Try lowering its learning rate, or try sgd with custom hyper-parameters. This does take a lot of tuning.
There is a free course on Coursera called Machine Learning by Stanford. This covers theory on these concepts (and more) in a good way.

How to interpret zigzag training loss?

My training data consists of about ~700 unique samples (this is for a regression problem). The data is not shuffled, so the first N samples have the same label (say, the value 1.25), then the next M samples have a the same label (say, 2.99), etc. In total there's around 15 unique labels.
I'm using a simple CNN, as the input is an image (64x64x3). Even with no dropout or any other form of regularization, I can't get the training loss to stabilize close to zero.
What is this pattern of the learning loss an indication of? (gray line is the training loss, orange line is the validation loss).
The only indication you can get from such pattern is that the learning rate is too large, you should decrease it until the loss starts to decrease.
It seems that your learning rate is
too large, making your parameters oscillate wildly.
Things that I recommend at that point would be to:
Decrease your initial learning rate
Try another optimizer with some sort of learning rate decay (e.g. ADAM which worked good for me in such cases)

[MXNet]Periodic Loss Value when training with "step" learning rate policy

When training deep CNN, a common way is to use SGD with momentum with a "step" learning rate policy (e.g. learning rate set to be 0.1,0.01,0.001.. at different stages of training).But I encounter an unexpected phenomenon when training with this strategy under MXNet.
That is the periodic training loss value
https://user-images.githubusercontent.com/26757001/31327825-356401b6-ad04-11e7-9aeb-3f690bc50df2.png
The above is the training loss at a fixed learning rate 0.01, where the loss is decreasing normally
https://user-images.githubusercontent.com/26757001/31327872-8093c3c4-ad04-11e7-8fbd-327b3916b278.png
However, at the second stage of training (with lr 0.001) , the loss goes up and down periodically, and the period is exactly an epoch
So I thought it might be the problem of data shuffling, but it cannot explain why it doesn't happen in the first stage. Actually I used ImageRecordIter as the DataIter and reset it after every epoch, is there anything I missed or set mistakenly?
train_iter = mx.io.ImageRecordIter(
path_imgrec=recPath,
data_shape=dataShape,
batch_size=batchSize,
last_batch_handle='discard',
shuffle=True,
rand_crop=True,
rand_mirror=True)
The codes for training and loss evaluation:
while True:
train_iter.reset()
for i,databatch in enumerate(train_iter):
globalIter += 1
mod.forward(databatch,is_train=True)
mod.update_metric(metric,databatch.label)
if globalIter % 100 == 0:
loss = metric.get()[1]
metric.reset()
mod.backward()
mod.update()
Actually the loss can converge, but it takes too long.
I've suffered from this problem for a long period of time, on different network and different datasets.
I didn't have this problem when using Caffe. Is this due to the implementation difference?
Your loss/learning curves look suspiciously smooth, and I believe you can observe the same oscillation in the loss even when the learning rate is set to 0.01 just at a smaller relative scale (i.e. if you 'zoomed in' to the chart you'd see the same pattern). You may have an issue with your data iterator passing the same batch for example. And your training loop looks wrong but this could be due to formatting (e.g. mod.update() only performed every 100 batches isn't correct).
You can observe periodicity in your loss when you're traveling across a valley in the loss surface, up and down the sides rather than down the valley. Choosing a lower learning rate can help fix this, and make sure you are using momentum too.

the learning rate change for the momentum optimizer

When running an existing Tensorflow implementation, I found that the learning rate keeps the same between different epochs. The original implementations uses tf.train.MomentumOptimizer, and has decay rate setup.
My understanding for the momentum optimizer is that learning rate should decrease along with the epochs. Why the learning rate keeps the same for my training process. Is that possible that the learning rate will also depend on the performance, e.g., if the performance does not change quickly, then the learning rate will keep the same. I think I am not very clear about the underlying mechanism of momentum optimizer, and feel confused that the learning rate keeps the same along with the epoch even though I guess it should keep decreasing based on the given decay rate.
The optimizer is defined as follows
learning_rate = 0.2
decay_rate = 0.95
self.learning_rate_node = tf.train.exponential_decay(learning_rate=learning_rate,
global_step=global_step,
decay_steps=training_iters,
decay_rate=decay_rate,
staircase=True)
optimizer = tf.train.MomentumOptimizer(learning_rate=self.learning_rate_node).minimize(self.net.cost,
global_step=global_step)
It is a little bit hard to tell without looking at the code if my answer will be helpful to you or not.
However if you need some insights on how the momentum optimizer works and how the learning rate should decay.
First the Vanilla GradientDescentMinimizer's update which is the most basic:
W^(n+1)=W^(n)-alpha*(gradient of cost wrt W)(W^n)
You are just following the opposite of the gradient.
The GradientDescentMinimizer with learning rate decay:
W^(n+1)=W^(n)-alpha(n)*(gradient of the cost wrt W)(W^n)
The only thing that changed is the learning rate alpha , which is now dependent of the step in Tensorflow the most used is the exponential decay where after N step the learning rate is divided by some constant i.e. 10.
This change often happens later in the training so you might need to let a few epochs pass by before seeing it decay.
The Momentumoptimizer:
you have to keep an additional variable: the update you have done just before i.e you have to store at each time step:
update^(n)=(W^(n)-W^(n-1))
Then the corrected update by momentum is:
update^(n+1)=mupdate^(n)-alpha(gradient of cost wrt W)(W^n)
So what you are doing is doing simple gradient descent but correcting it by remembering the immediate past (There are smarter and more complicated ways of doing it like Nesterov's momentum)
MomentumOptimizer with learning rate decay:
update^(n)=(W^(n)-W^(n-1))
update^(n+1)=mupdate^(n)-alpha(n)(gradient of cost wrt W)(W^n)
alpha is now dependent of n too.
So at one point it will starts slowing down as in gradient descent with learning rate decay but the decrease will be affected by the momentum.
For a complete review of those methods and more you have the excellent website which explains far better than me and Alec Radford's famous visualization which is better than a thousand words.
The learning rate should not depend on the performance unless it is specified in the decay !
It would help to see the code in question !
EDIT1:: Here is a working example that I think answer both questions you asked:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
#Pure SGD
BATCH_SIZE=1
#Batch Gradient Descent
#BATCH_SIZE=1000
starter_learning_rate=0.001
xdata=np.linspace(0.,2*np.pi,1000)[:,np.newaxis]
ydata=np.sin(xdata)+np.random.normal(0.0,0.05,size=1000)[:,np.newaxis]
plt.scatter(xdata,ydata)
x=tf.placeholder(tf.float32,[None,1])
y=tf.placeholder(tf.float32, [None,1])
#We define global_step as a variable initialized at 0
global_step=tf.Variable(0,trainable=False)
w1=tf.Variable(0.05*tf.random_normal((1,100)),tf.float32)
w2=tf.Variable(0.05*tf.random_normal((100,1)),tf.float32)
b1=tf.Variable(np.zeros([100]).astype("float32"),tf.float32)
b2=tf.Variable(np.zeros([1]).astype("float32"),tf.float32)
h1=tf.nn.relu(tf.matmul(x,w1)+b1)
y_model=tf.matmul(h1,w2)+b2
L=tf.reduce_mean(tf.square(y_model-y))
#We want to decrease the learning rate after having seen all the data 5 times
NUM_EPOCHS_PER_DECAY=5
LEARNING_RATE_DECAY_FACTOR=0.1
#Since the mechanism of the decay depends on the number of iterations and not epochs we have to connect the number of epochs to the number of iterations
#So if we have batch_size=1 we have to iterate exactly 1000 times to do one epoch so 5*1000=5000 iterations before decaying if the batch_size was 1000 1 iterations=1epoch and thus we decrease it after 5 iterations
num_batches_per_epoch=int(xdata.shape[0]/float(BATCH_SIZE))
decay_steps=int(num_batches_per_epoch*NUM_EPOCHS_PER_DECAY)
decayed_learning_rate=tf.train.exponential_decay(starter_learning_rate,
global_step,
decay_steps,
LEARNING_RATE_DECAY_FACTOR,
staircase=True)
#So now we have an object that depends on global_step and that will be divided by 10 every decay_steps iterations i.e. when global_step=N*decay_steps with N a non zero integer
#We now create a train_step to which we pass the learning rate created each time this function is called global_step will be incremented by 1 we are gonna check that it is the case BE CAREFUL WE HAVE TO GIVE IT GLOBAL_STEP AS AN ARGUMENT
train_step=tf.train.GradientDescentOptimizer(decayed_learning_rate).minimize(L,global_step=global_step)
sess=tf.Session()
sess.run(tf.initialize_all_variables())
GLOBAL_s=[]
lr_val=[]
COSTS=[]
for i in range(16000):
#We will do 1600 iterations so as there is a decay every 5000 iterations we will see 3 decays (5000,10000,15000)
start_data=(i*BATCH_SIZE)%1000
COSTS.append([sess.run(L, feed_dict={x:xdata,y:ydata})])
GLOBAL_s.append([sess.run(global_step)])
lr_val.append([sess.run(decayed_learning_rate)])
#I see the train_step as implicitely executing sess.run(tf.add(global_step,1))
sess.run(train_step,feed_dict={x:xdata[start_data:start_data+BATCH_SIZE],y:ydata[start_data:start_data+BATCH_SIZE]})
plt.figure()
plt.subplot(211)
plt.plot(GLOBAL_s,lr_val,"-b")
plt.title("Evolution of learning rate" )
plt.subplot(212)
plt.plot(GLOBAL_s,COSTS,".g")
plt.title("Evolution of cost" )
#notice two things first global_step is actually being incremented and learning rate is actually being decayed
(You can writeMomentumOptimize() instead of GradientDescentOptimizer() obviously...)
Here are the two plots I get:
To sum it up in my mind when you call train_step tensorflow runs tf.add(global_step,1)