Why is momentum as time constant is different in the first epoch with parallel training? - cntk

Given configuration momentumPerMB=0.9, I observe the value of momentumAsTimeConstant in the first epoch way off. Remaining epochs have this value as expected. This seems to happen only in parallel training (1bit and BM, didn’t verify for MA yet).
01/11/2017 00:08:08: Starting Epoch 1: learning rate per sample = 0.000500 effective momentum = 0.900000 momentum as time constant = 155504.2 samples
01/11/2017 00:18:04: Starting Epoch 2: learning rate per sample = 0.000500 effective momentum = 0.900000 momentum as time constant = 19438.0 samples
Any ideas why this happens?

We recommend specifying momentumAsTimeConstant because this measure is invariant to mini batch size.

Related

How to use OneCycleLR?

I want to train on CIFAR-10, suppose for 200 epochs.
This is my optimizer:
optimizer = optim.Adam([x for x in model.parameters() if x.requires_grad], lr=0.001)
I want to use OneCycleLR as scheduler. Now, according to the documentation, these are the parameters of OneCycleLR:
torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr, total_steps=None, epochs=None, steps_per_epoch=None, pct_start=0.3, anneal_strategy='cos', cycle_momentum=True, base_momentum=0.85, max_momentum=0.95, div_factor=25.0, final_div_factor=10000.0, three_phase=False, last_epoch=- 1, verbose=False)
I have seen that the most used are max_lr, epochs and steps_per_epoch. The documentation says this:
max_lr (float or list) – Upper learning rate boundaries in the cycle for each parameter group.
epochs (int) – The number of epochs to train for. This is used along with steps_per_epoch in order to infer the total number of steps in the cycle if a value for total_steps is not provided. Default: None
steps_per_epoch (int) – The number of steps per epoch to train for. This is used along with epochs in order to infer the total number of steps in the cycle if a value for total_steps is not provided. Default: None
About steps_per_epoch, I have seen in many github repo that it is used steps_per_epoch=len(data_loader), so if I have a batch size of 128, then this parameter it is equal to 128.
However I do not understand what are the other 2 parameters. If I want to train for 200 epochs, then epochs=200? Or this is a parameter that runs the scheduler only for epoch and then it restarts? For example, If I write epochs=10 inside the scheduler, but I train in total for 200, it is like 20 complete steps of the scheduler?
Then max_lr I have seen people using a value greater than the lr of the optimizer and other people using a smaller value. I think that max_lr must be greater than the lr (otherwise why it is called max :smiley: ?)
However, if I print the learning rate epoch by epoch, it assumes strange values. For example, in this setting:
optimizer = optim.Adam([x for x in model.parameters() if x.requires_grad], lr=0.001)
scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr = 0.01, epochs=200, steps_per_epoch=128)
And this is the learning rate:
Epoch 1: TrL=1.7557, TrA=0.3846, VL=1.4136, VA=0.4917, TeL=1.4266, TeA=0.4852, LR=0.0004,
Epoch 2: TrL=1.3414, TrA=0.5123, VL=1.2347, VA=0.5615, TeL=1.2231, TeA=0.5614, LR=0.0004,
...
Epoch 118: TrL=0.0972, TrA=0.9655, VL=0.8445, VA=0.8161, TeL=0.8764, TeA=0.8081, LR=0.0005,
Epoch 119: TrL=0.0939, TrA=0.9677, VL=0.8443, VA=0.8166, TeL=0.9094, TeA=0.8128, LR=0.0005,
So lr is increasing
The documentation says that you should give total_steps or both epochs & steps_per_epoch as arguments. The simple relation between them is total_steps = epochs * steps_per_epoch.
And total_steps is the total number of steps in the cycle. OneCycle in the name means there is only one cycle through the training.
max_lr is the maximum learning rate of OneCycleLR. To be exact, the learning rate will increate from max_lr / div_factor to max_lr in the first pct_start * total_steps steps, and decrease smoothly to max_lr / final_div_factor then.
Edit: For those who are not familiar with lr_scheduler, you can plot the learning rate curve, e.g.
EPOCHS = 10
BATCHES = 10
steps = []
lrs = []
model = ... # Your model instance
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9) # Wrapped optimizer
scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer,max_lr=0.9,total_steps=EPOCHS * BATCHES)
for epoch in range(EPOCHS):
for batch in range(BATCHES):
scheduler.step()
lrs.append(scheduler.get_last_lr()[0])
steps.append(epoch * BATCHES + batch)
plt.figure()
plt.legend()
plt.plot(steps, lrs, label='OneCycle')
plt.show()

Pytorch schedule learning rate

I am trying to re-implement one paper, which suggests to adjust the learning rate as below:
The learning rate is decreased by a factor of the regression value with patience epochs 10 on the change value of 0.0001.
Should I use the torch.optim.lr_scheduler.ReduceLROnPlateau()?
I am not sure what value should I pass to each parameter.
Is the change value in the statement denotes to the parameter threshold?
Is the factor in the statement denotes to the parameter factor?
torch.optim.lr_scheduler.ReduceLROnPlateau is indeed what you are looking for. I summarized all of the important stuff for you.
mode=min: lr will be reduced when the quantity monitored has stopped decreasing
factor: factor by which the learning rate will be reduced
patience: number of epochs with no improvement after which learning rate will be reduced
threshold: threshold for measuring the new optimum, to only focus on significant changes (change value). Say we have threshold=0.0001, if loss is 18.0 on epoch n and loss is 17.9999 on epoch n+1 then we have met our criteria to multiply the current learning rate by the factor.
criterion = torch.nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min',
factor=0.1, patience=10, threshold=0.0001, threshold_mode='abs')
for epoch in range(20):
# training loop stuff
loss = criterion(...)
scheduler.step(loss)
You can check more details in the documentation: https://pytorch.org/docs/stable/optim.html#torch.optim.lr_scheduler.ReduceLROnPlateau
Pytorch has many ways to let you reduce the learning rate. It is quite well explained here:
https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
#Antonino DiMaggio explained ReduceOnPlateau quite well. I just want to complement the answer to reply to the comment of #Yan-JenHuang:
Is it possible to decrease the learning_rate by minus a constant value instead by a factor?
First of all, you should be very careful to avoid negative values of lr! Second, subtracting a value of the learning rate is not common practice. But in any case...
You have first to make a custom lr scheduler (I modified the code of LambdaLR https://pytorch.org/docs/stable/_modules/torch/optim/lr_scheduler.html#LambdaLR):
torch.optim.lr_scheduler import _LRScheduler
class SubtractLR(_LRScheduler):
def __init__(self, optimizer, lr_lambda, last_epoch=-1, min_lr=e-6):
self.optimizer = optimizer
self.min_lr = min_lr # min learning rate > 0
if not isinstance(lr_lambda, list) and not isinstance(lr_lambda, tuple):
self.lr_lambdas = [lr_lambda] * len(optimizer.param_groups)
else:
if len(lr_lambda) != len(optimizer.param_groups):
raise ValueError("Expected {} lr_lambdas, but got {}".format(
len(optimizer.param_groups), len(lr_lambda)))
self.lr_lambdas = list(lr_lambda)
self.last_epoch = last_epoch
super(LambdaLR, self).__init__(optimizer, last_epoch)
def get_lr(self):
if not self._get_lr_called_within_step:
warnings.warn("To get the last learning rate computed by the scheduler, "
"please use `get_last_lr()`.")
return [(max(base_lr - lmbda(self.last_epoch), self.min_lr)
for lmbda, base_lr in zip(self.lr_lambdas, self.base_lrs)] # reduces the learning rate
Than you can use it in your training.
lambda1 = lambda epoch: e-4 # constant to subtract from lr
scheduler = SubtractLR(optimizer, lr_lambda=[lambda1])
for epoch in range(100):
train(...)
validate(...)
scheduler.step()
lambda1 = lambda epoch: epoch * e-6 # increases the value to subtract lr proportionally to the epoch
scheduler = SubtractLR(optimizer, lr_lambda=[lambda1])
for epoch in range(100):
train(...)
validate(...)
scheduler.step()
You can also modify the code of ReduceLROnPlateau to subtract the learning rate instead of mutiplying it. Your should change this line new_lr = max(old_lr * self.factor, self.min_lrs[i]) to something like new_lr = max(old_lr - self.factor, self.min_lrs[i]). You can take a look at the code yourself: https://pytorch.org/docs/stable/_modules/torch/optim/lr_scheduler.html#ReduceLROnPlateau
As a supplement for the above answer for ReduceLROnPlateau that threshold also has modes(rel|abs) in lr scheduler for pytorch (at least for vesions>=1.6), and the default is 'rel' which means if your loss is 18, it will change at least 18*0.0001=0.0018 to be recognized as an improvement. So, watch out the threshold mode as well.

Why does Tensorflow LearningRateScheduler *increase* (not decrease) the learning rate with each epoch?

On the Coursera class "TensorFlow in Practice -- Sequeneces, Time Series and Prediction" the 9th video of the second week uses a callback to dynamically increase (not decrease) the learning rate. I understand why we need to dynamically adjust the rate; but this callback is increasing the learning rate with each epoch. Don't we want to do the opposite and gradually decrease the learning rate as the neural net learns more? I'm sure the video is correct (it was created by Andrew Ng and Google, who obviously know a lot about TensorFlow) but why are we increasing (instead of decreasing) the learning rate? Is keras actually using the inverse of this number as the learning rate, or something like that?
#Doesn't the next line *increase* the learning rate with each callback?
#But shouldn't we be gradually decreasing it?
lr_schedule = tf.keras.callbacks.LearningRateScheduler(
lambda epoch: 1e-8 * 10**(epoch / 20))
optimizer = tf.keras.optimizers.SGD(lr=1e-8, momentum=0.9)
model.compile(loss="mse", optimizer=optimizer)
history = model.fit(dataset, epochs=100, callbacks=[lr_schedule], verbose=0)
And here's a full code example from the sample notebook that they provide with this example:
https://colab.research.google.com/github/lmoroney/dlaicourse/blob/master/TensorFlow%20In%20Practice/Course%204%20-%20S%2BP/S%2BP%20Week%202%20Lesson%203.ipynb
Is it correct to increase the learning rate with each epoch? Won't that result in the optimizer "over-shooting" the answer on each epoch and never converging to a solution?
You are right. It does not make any sense to actually do this when the goal is to train a network. Might they be doing this to demonstrate that your learning rate can be too high? the graph just after it might be showing such a lesson.

[MXNet]Periodic Loss Value when training with "step" learning rate policy

When training deep CNN, a common way is to use SGD with momentum with a "step" learning rate policy (e.g. learning rate set to be 0.1,0.01,0.001.. at different stages of training).But I encounter an unexpected phenomenon when training with this strategy under MXNet.
That is the periodic training loss value
https://user-images.githubusercontent.com/26757001/31327825-356401b6-ad04-11e7-9aeb-3f690bc50df2.png
The above is the training loss at a fixed learning rate 0.01, where the loss is decreasing normally
https://user-images.githubusercontent.com/26757001/31327872-8093c3c4-ad04-11e7-8fbd-327b3916b278.png
However, at the second stage of training (with lr 0.001) , the loss goes up and down periodically, and the period is exactly an epoch
So I thought it might be the problem of data shuffling, but it cannot explain why it doesn't happen in the first stage. Actually I used ImageRecordIter as the DataIter and reset it after every epoch, is there anything I missed or set mistakenly?
train_iter = mx.io.ImageRecordIter(
path_imgrec=recPath,
data_shape=dataShape,
batch_size=batchSize,
last_batch_handle='discard',
shuffle=True,
rand_crop=True,
rand_mirror=True)
The codes for training and loss evaluation:
while True:
train_iter.reset()
for i,databatch in enumerate(train_iter):
globalIter += 1
mod.forward(databatch,is_train=True)
mod.update_metric(metric,databatch.label)
if globalIter % 100 == 0:
loss = metric.get()[1]
metric.reset()
mod.backward()
mod.update()
Actually the loss can converge, but it takes too long.
I've suffered from this problem for a long period of time, on different network and different datasets.
I didn't have this problem when using Caffe. Is this due to the implementation difference?
Your loss/learning curves look suspiciously smooth, and I believe you can observe the same oscillation in the loss even when the learning rate is set to 0.01 just at a smaller relative scale (i.e. if you 'zoomed in' to the chart you'd see the same pattern). You may have an issue with your data iterator passing the same batch for example. And your training loop looks wrong but this could be due to formatting (e.g. mod.update() only performed every 100 batches isn't correct).
You can observe periodicity in your loss when you're traveling across a valley in the loss surface, up and down the sides rather than down the valley. Choosing a lower learning rate can help fix this, and make sure you are using momentum too.

the learning rate change for the momentum optimizer

When running an existing Tensorflow implementation, I found that the learning rate keeps the same between different epochs. The original implementations uses tf.train.MomentumOptimizer, and has decay rate setup.
My understanding for the momentum optimizer is that learning rate should decrease along with the epochs. Why the learning rate keeps the same for my training process. Is that possible that the learning rate will also depend on the performance, e.g., if the performance does not change quickly, then the learning rate will keep the same. I think I am not very clear about the underlying mechanism of momentum optimizer, and feel confused that the learning rate keeps the same along with the epoch even though I guess it should keep decreasing based on the given decay rate.
The optimizer is defined as follows
learning_rate = 0.2
decay_rate = 0.95
self.learning_rate_node = tf.train.exponential_decay(learning_rate=learning_rate,
global_step=global_step,
decay_steps=training_iters,
decay_rate=decay_rate,
staircase=True)
optimizer = tf.train.MomentumOptimizer(learning_rate=self.learning_rate_node).minimize(self.net.cost,
global_step=global_step)
It is a little bit hard to tell without looking at the code if my answer will be helpful to you or not.
However if you need some insights on how the momentum optimizer works and how the learning rate should decay.
First the Vanilla GradientDescentMinimizer's update which is the most basic:
W^(n+1)=W^(n)-alpha*(gradient of cost wrt W)(W^n)
You are just following the opposite of the gradient.
The GradientDescentMinimizer with learning rate decay:
W^(n+1)=W^(n)-alpha(n)*(gradient of the cost wrt W)(W^n)
The only thing that changed is the learning rate alpha , which is now dependent of the step in Tensorflow the most used is the exponential decay where after N step the learning rate is divided by some constant i.e. 10.
This change often happens later in the training so you might need to let a few epochs pass by before seeing it decay.
The Momentumoptimizer:
you have to keep an additional variable: the update you have done just before i.e you have to store at each time step:
update^(n)=(W^(n)-W^(n-1))
Then the corrected update by momentum is:
update^(n+1)=mupdate^(n)-alpha(gradient of cost wrt W)(W^n)
So what you are doing is doing simple gradient descent but correcting it by remembering the immediate past (There are smarter and more complicated ways of doing it like Nesterov's momentum)
MomentumOptimizer with learning rate decay:
update^(n)=(W^(n)-W^(n-1))
update^(n+1)=mupdate^(n)-alpha(n)(gradient of cost wrt W)(W^n)
alpha is now dependent of n too.
So at one point it will starts slowing down as in gradient descent with learning rate decay but the decrease will be affected by the momentum.
For a complete review of those methods and more you have the excellent website which explains far better than me and Alec Radford's famous visualization which is better than a thousand words.
The learning rate should not depend on the performance unless it is specified in the decay !
It would help to see the code in question !
EDIT1:: Here is a working example that I think answer both questions you asked:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
#Pure SGD
BATCH_SIZE=1
#Batch Gradient Descent
#BATCH_SIZE=1000
starter_learning_rate=0.001
xdata=np.linspace(0.,2*np.pi,1000)[:,np.newaxis]
ydata=np.sin(xdata)+np.random.normal(0.0,0.05,size=1000)[:,np.newaxis]
plt.scatter(xdata,ydata)
x=tf.placeholder(tf.float32,[None,1])
y=tf.placeholder(tf.float32, [None,1])
#We define global_step as a variable initialized at 0
global_step=tf.Variable(0,trainable=False)
w1=tf.Variable(0.05*tf.random_normal((1,100)),tf.float32)
w2=tf.Variable(0.05*tf.random_normal((100,1)),tf.float32)
b1=tf.Variable(np.zeros([100]).astype("float32"),tf.float32)
b2=tf.Variable(np.zeros([1]).astype("float32"),tf.float32)
h1=tf.nn.relu(tf.matmul(x,w1)+b1)
y_model=tf.matmul(h1,w2)+b2
L=tf.reduce_mean(tf.square(y_model-y))
#We want to decrease the learning rate after having seen all the data 5 times
NUM_EPOCHS_PER_DECAY=5
LEARNING_RATE_DECAY_FACTOR=0.1
#Since the mechanism of the decay depends on the number of iterations and not epochs we have to connect the number of epochs to the number of iterations
#So if we have batch_size=1 we have to iterate exactly 1000 times to do one epoch so 5*1000=5000 iterations before decaying if the batch_size was 1000 1 iterations=1epoch and thus we decrease it after 5 iterations
num_batches_per_epoch=int(xdata.shape[0]/float(BATCH_SIZE))
decay_steps=int(num_batches_per_epoch*NUM_EPOCHS_PER_DECAY)
decayed_learning_rate=tf.train.exponential_decay(starter_learning_rate,
global_step,
decay_steps,
LEARNING_RATE_DECAY_FACTOR,
staircase=True)
#So now we have an object that depends on global_step and that will be divided by 10 every decay_steps iterations i.e. when global_step=N*decay_steps with N a non zero integer
#We now create a train_step to which we pass the learning rate created each time this function is called global_step will be incremented by 1 we are gonna check that it is the case BE CAREFUL WE HAVE TO GIVE IT GLOBAL_STEP AS AN ARGUMENT
train_step=tf.train.GradientDescentOptimizer(decayed_learning_rate).minimize(L,global_step=global_step)
sess=tf.Session()
sess.run(tf.initialize_all_variables())
GLOBAL_s=[]
lr_val=[]
COSTS=[]
for i in range(16000):
#We will do 1600 iterations so as there is a decay every 5000 iterations we will see 3 decays (5000,10000,15000)
start_data=(i*BATCH_SIZE)%1000
COSTS.append([sess.run(L, feed_dict={x:xdata,y:ydata})])
GLOBAL_s.append([sess.run(global_step)])
lr_val.append([sess.run(decayed_learning_rate)])
#I see the train_step as implicitely executing sess.run(tf.add(global_step,1))
sess.run(train_step,feed_dict={x:xdata[start_data:start_data+BATCH_SIZE],y:ydata[start_data:start_data+BATCH_SIZE]})
plt.figure()
plt.subplot(211)
plt.plot(GLOBAL_s,lr_val,"-b")
plt.title("Evolution of learning rate" )
plt.subplot(212)
plt.plot(GLOBAL_s,COSTS,".g")
plt.title("Evolution of cost" )
#notice two things first global_step is actually being incremented and learning rate is actually being decayed
(You can writeMomentumOptimize() instead of GradientDescentOptimizer() obviously...)
Here are the two plots I get:
To sum it up in my mind when you call train_step tensorflow runs tf.add(global_step,1)