How to use OneCycleLR? - optimization

I want to train on CIFAR-10, suppose for 200 epochs.
This is my optimizer:
optimizer = optim.Adam([x for x in model.parameters() if x.requires_grad], lr=0.001)
I want to use OneCycleLR as scheduler. Now, according to the documentation, these are the parameters of OneCycleLR:
torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr, total_steps=None, epochs=None, steps_per_epoch=None, pct_start=0.3, anneal_strategy='cos', cycle_momentum=True, base_momentum=0.85, max_momentum=0.95, div_factor=25.0, final_div_factor=10000.0, three_phase=False, last_epoch=- 1, verbose=False)
I have seen that the most used are max_lr, epochs and steps_per_epoch. The documentation says this:
max_lr (float or list) – Upper learning rate boundaries in the cycle for each parameter group.
epochs (int) – The number of epochs to train for. This is used along with steps_per_epoch in order to infer the total number of steps in the cycle if a value for total_steps is not provided. Default: None
steps_per_epoch (int) – The number of steps per epoch to train for. This is used along with epochs in order to infer the total number of steps in the cycle if a value for total_steps is not provided. Default: None
About steps_per_epoch, I have seen in many github repo that it is used steps_per_epoch=len(data_loader), so if I have a batch size of 128, then this parameter it is equal to 128.
However I do not understand what are the other 2 parameters. If I want to train for 200 epochs, then epochs=200? Or this is a parameter that runs the scheduler only for epoch and then it restarts? For example, If I write epochs=10 inside the scheduler, but I train in total for 200, it is like 20 complete steps of the scheduler?
Then max_lr I have seen people using a value greater than the lr of the optimizer and other people using a smaller value. I think that max_lr must be greater than the lr (otherwise why it is called max :smiley: ?)
However, if I print the learning rate epoch by epoch, it assumes strange values. For example, in this setting:
optimizer = optim.Adam([x for x in model.parameters() if x.requires_grad], lr=0.001)
scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr = 0.01, epochs=200, steps_per_epoch=128)
And this is the learning rate:
Epoch 1: TrL=1.7557, TrA=0.3846, VL=1.4136, VA=0.4917, TeL=1.4266, TeA=0.4852, LR=0.0004,
Epoch 2: TrL=1.3414, TrA=0.5123, VL=1.2347, VA=0.5615, TeL=1.2231, TeA=0.5614, LR=0.0004,
...
Epoch 118: TrL=0.0972, TrA=0.9655, VL=0.8445, VA=0.8161, TeL=0.8764, TeA=0.8081, LR=0.0005,
Epoch 119: TrL=0.0939, TrA=0.9677, VL=0.8443, VA=0.8166, TeL=0.9094, TeA=0.8128, LR=0.0005,
So lr is increasing

The documentation says that you should give total_steps or both epochs & steps_per_epoch as arguments. The simple relation between them is total_steps = epochs * steps_per_epoch.
And total_steps is the total number of steps in the cycle. OneCycle in the name means there is only one cycle through the training.
max_lr is the maximum learning rate of OneCycleLR. To be exact, the learning rate will increate from max_lr / div_factor to max_lr in the first pct_start * total_steps steps, and decrease smoothly to max_lr / final_div_factor then.
Edit: For those who are not familiar with lr_scheduler, you can plot the learning rate curve, e.g.
EPOCHS = 10
BATCHES = 10
steps = []
lrs = []
model = ... # Your model instance
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9) # Wrapped optimizer
scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer,max_lr=0.9,total_steps=EPOCHS * BATCHES)
for epoch in range(EPOCHS):
for batch in range(BATCHES):
scheduler.step()
lrs.append(scheduler.get_last_lr()[0])
steps.append(epoch * BATCHES + batch)
plt.figure()
plt.legend()
plt.plot(steps, lrs, label='OneCycle')
plt.show()

Related

What to do if I don't want to waste samples by flooring steps_per_epoch in model.fit?

I have a model that has 536 training samples, and would like to run through all the samples per epoch. The batch size is 32, epoch is 50. Here is the code and its error:
results = model.fit(train_X, train_y, batch_size = 32, epochs = 50, validation_data=(val_X, val_y), callbacks=callbacks)
The dataset you passed contains 837 batches, but you passed epochs=50 and steps_per_epoch=17, which is a total of 850 steps. We cannot draw that many steps from this dataset. We suggest to set steps_per_epoch=16.
Total number of samples / batch size = steps per epoch = 536/32 = 16.75. The model.fit would work if I set steps per epoch = 16. Doesn't this mean I'm discarding 24 samples (0.75 * 32) per each epoch?
If yes, how can I not discard these samples? One way would be adjusting batch size to have no residual when diving # of samples by it.
If there are other ways, please enlighten me.
If you do not want to discard any sample of data to train, It's better not to use steps_per_epoch after defining the batch_size. Because the model itself can calculate the steps_per_epoch while training the input dataset based on defined batch_size and provided training samples.
Please go through with this attached gist where I have tried to elaborate these terms in more detail.

Pytorch schedule learning rate

I am trying to re-implement one paper, which suggests to adjust the learning rate as below:
The learning rate is decreased by a factor of the regression value with patience epochs 10 on the change value of 0.0001.
Should I use the torch.optim.lr_scheduler.ReduceLROnPlateau()?
I am not sure what value should I pass to each parameter.
Is the change value in the statement denotes to the parameter threshold?
Is the factor in the statement denotes to the parameter factor?
torch.optim.lr_scheduler.ReduceLROnPlateau is indeed what you are looking for. I summarized all of the important stuff for you.
mode=min: lr will be reduced when the quantity monitored has stopped decreasing
factor: factor by which the learning rate will be reduced
patience: number of epochs with no improvement after which learning rate will be reduced
threshold: threshold for measuring the new optimum, to only focus on significant changes (change value). Say we have threshold=0.0001, if loss is 18.0 on epoch n and loss is 17.9999 on epoch n+1 then we have met our criteria to multiply the current learning rate by the factor.
criterion = torch.nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min',
factor=0.1, patience=10, threshold=0.0001, threshold_mode='abs')
for epoch in range(20):
# training loop stuff
loss = criterion(...)
scheduler.step(loss)
You can check more details in the documentation: https://pytorch.org/docs/stable/optim.html#torch.optim.lr_scheduler.ReduceLROnPlateau
Pytorch has many ways to let you reduce the learning rate. It is quite well explained here:
https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
#Antonino DiMaggio explained ReduceOnPlateau quite well. I just want to complement the answer to reply to the comment of #Yan-JenHuang:
Is it possible to decrease the learning_rate by minus a constant value instead by a factor?
First of all, you should be very careful to avoid negative values of lr! Second, subtracting a value of the learning rate is not common practice. But in any case...
You have first to make a custom lr scheduler (I modified the code of LambdaLR https://pytorch.org/docs/stable/_modules/torch/optim/lr_scheduler.html#LambdaLR):
torch.optim.lr_scheduler import _LRScheduler
class SubtractLR(_LRScheduler):
def __init__(self, optimizer, lr_lambda, last_epoch=-1, min_lr=e-6):
self.optimizer = optimizer
self.min_lr = min_lr # min learning rate > 0
if not isinstance(lr_lambda, list) and not isinstance(lr_lambda, tuple):
self.lr_lambdas = [lr_lambda] * len(optimizer.param_groups)
else:
if len(lr_lambda) != len(optimizer.param_groups):
raise ValueError("Expected {} lr_lambdas, but got {}".format(
len(optimizer.param_groups), len(lr_lambda)))
self.lr_lambdas = list(lr_lambda)
self.last_epoch = last_epoch
super(LambdaLR, self).__init__(optimizer, last_epoch)
def get_lr(self):
if not self._get_lr_called_within_step:
warnings.warn("To get the last learning rate computed by the scheduler, "
"please use `get_last_lr()`.")
return [(max(base_lr - lmbda(self.last_epoch), self.min_lr)
for lmbda, base_lr in zip(self.lr_lambdas, self.base_lrs)] # reduces the learning rate
Than you can use it in your training.
lambda1 = lambda epoch: e-4 # constant to subtract from lr
scheduler = SubtractLR(optimizer, lr_lambda=[lambda1])
for epoch in range(100):
train(...)
validate(...)
scheduler.step()
lambda1 = lambda epoch: epoch * e-6 # increases the value to subtract lr proportionally to the epoch
scheduler = SubtractLR(optimizer, lr_lambda=[lambda1])
for epoch in range(100):
train(...)
validate(...)
scheduler.step()
You can also modify the code of ReduceLROnPlateau to subtract the learning rate instead of mutiplying it. Your should change this line new_lr = max(old_lr * self.factor, self.min_lrs[i]) to something like new_lr = max(old_lr - self.factor, self.min_lrs[i]). You can take a look at the code yourself: https://pytorch.org/docs/stable/_modules/torch/optim/lr_scheduler.html#ReduceLROnPlateau
As a supplement for the above answer for ReduceLROnPlateau that threshold also has modes(rel|abs) in lr scheduler for pytorch (at least for vesions>=1.6), and the default is 'rel' which means if your loss is 18, it will change at least 18*0.0001=0.0018 to be recognized as an improvement. So, watch out the threshold mode as well.

Running training the discriminator with more examples

As I understand what of the diff between regular GAN to WGAN is that we train the discriminator/critic with more examples in each epoch. If in the regular gan we have in each epoch one batch for both modules, in WGAN we will have 5 batches (or more) for the discriminator and one for the generator.
So basically we have another inner loop for the discriminator :
real_images_labels = np.ones((BATCH_SIZE, 1))
fake_images_labels = -real_images_labels
for epoch in range(epochs):
for batch in range(NUM_BACHES):
for critic_iter in range(n_critic):
random_batches_idx = np.random.randint(0, NUM_BACHES) # Choose random batch from dataset
imgs_data=dataset_list[random_batches_idx]
c_loss_real = critic.train_on_batch(imgs_data, real_images_labels) # update the weights after 1 batch
noise = tf.random.normal([imgs_data.shape[0], noise_dim]) # Generate noise data
generated_images = generator(noise, training=True)
c_loss_fake = critic.train_on_batch(generated_images, fake_images_labels) # update the weights after 1 batch
imgs_data=dataset_list[batch]
noise = tf.random.normal([imgs_data.shape[0], noise_dim]) # Generate noise data
gen_loss_batch = gen_loss_batch + gan.train_on_batch(noise,real_images_labels)
The training is taking me a lot of time, per epoch about 3m. The idea I had to decrease the training time is instead running forward for each batch n_critic times I can increase the batch_size for the discriminator and run forward one time with a bigger batch_size.
I am seeking feedback: does it sound reasonable?
(I didn't paste my entire code, it was just a part of it).
Yes, it does sound reasonable typically increasing batch_size during training, typically decreases the training time with a cost of using more memory and lower accuracy (lower generalization ability).
Having said this you should do always do trial and error with regards to batching as extreme values may or may not increase the training time.
For further discussion you can refer to this question

What does steps mean in the train method of tf.estimator.Estimator?

I'm completely confused with the meaning of epochs, and steps. I also read the issue What is the difference between steps and epochs in TensorFlow?, But I'm not sure about the answer. Consider this part of code:
EVAL_EVERY_N_STEPS = 100
MAX_STEPS = 10000
nn = tf.estimator.Estimator(
model_fn=model_fn,
model_dir=args.model_path,
params={"learning_rate": 0.001},
config=tf.estimator.RunConfig())
for _ in range(MAX_STEPS // EVAL_EVERY_N_STEPS):
print(_)
nn.train(input_fn=train_input_fn,
hooks=[train_qinit_hook, step_cnt_hook],
steps=EVAL_EVERY_N_STEPS)
if args.run_validation:
results_val = nn.evaluate(input_fn=val_input_fn,
hooks=[val_qinit_hook,
val_summary_hook],
steps=EVAL_STEPS)
print('Step = {}; val loss = {:.5f};'.format(
results_val['global_step'],
results_val['loss']))
end
Also, the number of training samples is 400. I consider the MAX_STEPS // EVAL_EVERY_N_STEPS equal to epochs (or iterations). Indeed, the number of epochs is 100. What does the steps mean in nn.train?
In Deep Learning:
an epoch means one pass over the entire training set.
a step or iteration corresponds to one forward pass and one backward pass.
If your dataset is not divided and passed as is to your algorithm, each step corresponds to one epoch, but usually, a training set is divided into N mini-batches. Then, each step goes through one batch and you need N steps to complete a full epoch.
Here, if batch_size == 4 then 100 steps are indeed equal to one epoch.
epochs = batch_size * steps // n_training_samples

TensorFlow DataSet shuffle - data shuffling only starting from second epoch

I am using the tensorflow DataSet for input data pipeline. I am wondering how to run training without data shuffling in first epoch and start shuffling data from the second epoch.
the graph is usually built before iterative training start and during training it seems not straight-forward how to change the DataSet shuffling behavior since it looks to me kinds of like changing the graph.
any idea?
thanks,
Harry
The buffer_size argument to Dataset.shuffle() can be a computed tf.Tensor, so you can use the following code that uses Dataset.range(NUM_EPOCHS).flat_map(...) to transform a sequence of epoch numbers to the (shuffled or otherwise) elements of a per_epoch_dataset:
NUM_EPOCHS = ... # The total number of epochs.
BUFFER_SIZE = ... # The shuffle buffer size to use from the second epoch on.
per_epoch_dataset = ... # A `Dataset` representing the elements of a single epoch.
def shuffle_after_first_epoch(epoch):
# Set `epoch_buffer_size` to 1 (i.e. no shuffling) in the 0th epoch,
# and `BUFFER_SIZE` thereafter.
epoch_buffer_size = tf.cond(tf.equal(epoch, 0),
lambda: tf.constant(1, tf.int64),
lambda: tf.constant(BUFFER_SIZE, tf.int64))
return per_epoch_dataset.shuffle(epoch_buffer_size)
dataset = tf.data.Dataset.range(NUM_EPOCHS).flat_map(shuffle_after_first_epoch)