Multiple calls to fit when using LearningRateSchedule - tensorflow

if I "compile" a Keras model with an optimizer using a LearningRateSchedule and run model.fit() for one epoch several times, will it restart the learning rate scheduler every time or will it preserve its state?
model = create_keras_model()
lr_scheduler = create_lr_scheduler()
optimizer = Adam(learning_rate=lr_scheduler)
for i in range(10)
model.fit(dataset, epochs=1)
Thanks.

Related

Keras OOM for data validation using GPU

I'm trying to run a deep model using GPU and seems Keras is running the validation against the whole validation data set in one batch instead of validating in many batches and that's causing out of memory problem:
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM
when allocating tensor with shape[160000,64,64,1] and type double on
/job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[Op:GatherV2]
I did not have this problem when I was running on CPU, it's just happening when I'm running on GPU, my fit code looks like this
history = model.fit(patches_imgs_train, patches_masks_train, batch_size=8,
epochs=10, shuffle=True, verbose=1, validation_split=0.2)
When I delete the validation parameter from the fit method the code works, but I need the validation.
Since no one is answering this, I can offer you a workaround. You can separate fit() and evaluate() and run the evaluation on CPU.
You'll have to split your data manually to provide the testx and testy to evaluate().
for i in range(10):
with tf.device('/GPU:0'):
model.fit(x, y, epochs=1)
with tf.device('/CPU:0'):
loss, acc = model.evaluate(testx, testy)
You'll need deal with the accuracy values if you wanted some early stop.
It isn't perfect but it'll allow you to run much larger networks without OOMs.
Hope it helps.
So I could consider what is happening as a bug in Keras implementation, looks like it's trying to load the whole data set to the memory for splitting it into validation and training sets and it's not related to batch size, after trying many ways to go around it I found the best way to approach it is splitting the data using sklearn train_test_split instead of splitting it down in the fitting method using validation_split param.
x_train, x_v, y_train, y_v = train_test_split(x,y,test_size = 0.2,train_size =0.8)
history = model.fit(x_train,y_train,
batch_size=16,
epochs=5,
shuffle=True,
verbose=2,
validation_data=(x_v, y_v))

can't reproduce model.fit with GradientTape

I've been trying to investigate into the reason (e.g. by checking weights, gradients and activations during training) why SGD with a 0.001 learning rate worked in training while Adam fails to do so. (Please see my previous post [here](Why is my loss (binary cross entropy) converging on ~0.6? (Task: Natural Language Inference)"Why is my loss (binary cross entropy) converging on ~0.6? (Task: Natural Language Inference)"))
Note: I'm using the same model from my previous post here as well.
using tf.keras, i trained the neural network using model.fit():
model.compile(optimizer=SGD(learning_rate=0.001),
loss='binary_crossentropy',
metrics=['accuracy'])
model.fit(x=ds,
epoch=80,
validation_data=ds_val)
This resulted in a epoch loss graphed below, within the 1st epoch, it's reached a train loss of 0.46 and then ultimately resulting in a train_loss of 0.1241 and val_loss of 0.2849.
I would've used tf.keras.callbacks.Tensorboard(histogram_freq=1) to train the network with both SGD(0.001) and Adam to investigate but it's throwing an InvalidArgumentError on Variable:0, something I can't decipher. So I tried to write a custom training loop using GradientTape and plotting the values.
using tf.GradientTape(), i tried to reproduce the results using the exact same model and dataset, however the epoch loss is training incredibly slowly, reaching train loss of 0.676 after 15 epochs (see graph below), is there something wrong with my implementation? (code below)
#tf.function
def compute_grads(train_batch: Dict[str,tf.Tensor], target_batch: tf.Tensor,
loss_fn: Loss, model: tf.keras.Model):
with tf.GradientTape(persistent=False) as tape:
# forward pass
outputs = model(train_batch)
# calculate loss
loss = loss_fn(y_true=target_batch, y_pred=outputs)
# calculate gradients for each param
grads = tape.gradient(loss, model.trainable_variables)
return grads, loss
BATCH_SIZE = 8
EPOCHS = 15
bce = BinaryCrossentropy()
optimizer = SGD(learning_rate=0.001)
for epoch in tqdm(range(EPOCHS), desc='epoch'):
# - accumulators
epoch_loss = 0.0
for (i, (train_batch, target_dict)) in tqdm(enumerate(ds_train.shuffle(1024).batch(BATCH_SIZE)), desc='step'):
(grads, loss) = compute_grads(train_batch, target_dict['target'], bce, model)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
epoch_loss += loss
avg_epoch_loss = epoch_loss/(i+1)
tensorboard_scalar(writer, name='epoch_loss', data=avg_epoch_loss, step=epoch) # custom helper function
print("Epoch {}: epoch_loss = {}".format(epoch, avg_epoch_loss))
Thanks in advance!
Check if you have shuffle your dataset then the problem may came from the shuffling using the tf.Dataset method. It only shuffled through the dataset one bucket at the time. Using the Keras.Model.fit yielded better results because it probably adds another shuffling.
By adding a shuffling with numpy.random.shuffle it may improve the training performance. From this reference.
The example of applying it into generation of the dataset is:
numpy_data = np.hstack([index_rows.reshape(-1, 1), index_cols.reshape(-1, 1), index_data.reshape(-1, 1)])
np.random.shuffle(numpy_data)
indexes = np.array(numpy_data[:, :2], dtype=np.uint32)
labels = np.array(numpy_data[:, 2].reshape(-1, 1), dtype=np.float32)
train_ds = data.Dataset.from_tensor_slices(
(indexes, labels)
).shuffle(100000).batch(batch_size, drop_remainder=True)
If this not work you may need to use Dataset .repeat(epochs_number) and .shuffle(..., reshuffle_each_iteration=True):
train_ds = data.Dataset.from_tensor_slices(
(np.hstack([index_rows.reshape(-1, 1), index_cols.reshape(-1, 1)]), index_data)
).shuffle(100000, reshuffle_each_iteration=True
).batch(batch_size, drop_remainder=True
).repeat(epochs_number)
for ix, (examples, labels) in train_ds.enumerate():
train_step(examples, labels)
current_epoch = ix // (len(index_data) // batch_size)
This workaround is not beautiful nor natural, for the moment you can use this to shuffle each epoch. It's a known issue and will be fixed, in the future you can use for epoch in range(epochs_number) instead of .repeat()
The solution provided here may also help a lot. You might want to check it out.
If this is not the case, you may want to speed up the TF2.0 GradientTape. This can be the solution:
TensorFlow 2.0 introduces the concept of functions, which translate eager code into graph code.
The usage is pretty straight-forward. The only change needed is that all relevant functions (like compute_loss and apply_gradients) have to be annotated with #tf.function.

Best practices in Tensorflow 2.0(Training step)

In tensorflow 2.0 you don't have to worry about training phase(batch size, number of epochs etc), because everything can be defined in compile method: model.fit(X_train,Y_train,batch_size = 64,epochs = 100).
But I have seen the following code style:
optimizer = tf.keras.optimizers.Adam(0.001)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy()
#tf.function
def train_step(inputs, labels):
with tf.GradientTape() as tape:
predictions = model(inputs, training=True)
regularization_loss = tf.math.add_n(model.losses)
pred_loss = loss_fn(labels, predictions)
total_loss = pred_loss + regularization_loss
gradients = tape.gradient(total_loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
for epoch in range(NUM_EPOCHS):
for inputs, labels in train_data:
train_step(inputs, labels)
print("Finished epoch", epoch)
So here you can observe "more detailed" code, where you manually define by for loops you training procedure.
I have following question: what is the best practice in Tensorflow 2.0? I haven't found a any complete tutorial.
Use what is best for your needs.
Both methods are documented in Tensorflow tutorials.
If you don't need anything special, no extra losses, strange metrics or intricate gradient computation, just use a model.fit() or a model.fit_generator(). This is totally ok and makes your life easier.
A custom training loop might come in handy when you have complicated models with non-trivial loss/gradients calculation.
Up to now, two applications I tried were easier with this:
Training a GAN's generator and discriminator simultaneously without having to do the generation step twice. (It's complicated because you have a loss function that applies to different y_true values, and each case should update only a part of the model) - The other option would require to have a few separate models, each model with its own trainable=True/False configuration, and train then in separate phases.
Training inputs (good for style transfer models) -- Alternatively, create a custom layer that takes dummy inputs and that outputs its own trainable weights. But it gets complicated to compile several loss functions for each of the outputs of the base and style networks.

A Tensorflow training agnostic to Eager and Graph modes

I spend some of my time coding novel (I wish) RNN cells in Tensorflow.
To prototype, I use eager mode (easier to debug).
In order to train, I migrate the code to a graph (runs faster).
I am looking for a wrapper code/example that can run forward pass and training in a way that will be agnostic to the mode I run it - eager or graph, as much as possible. I have in mind a set of functions/classes, to which the particular neural network/optimizer/data can be inserted, and that these set of functions/classes could run in both modes with minimal changes between the two. In addition, it is of course good that it would be compatible with many types of NN/optimizers/data instances.
I am quite sure that many had this idea.
I wonder if something like this is feasible given the current eager/graph integration in TF.
Yes. I have been wondering the same. In the Tensorflow documentation you can see:
The same code written for eager execution will also build a graph during graph execution. Do this by simply running the same code in a new Python session where eager execution is not enabled.
But this is hard to achieve, mostly because working with graphs means dealing with placeholders, which can not be used in Eager mode. I tried to get rid off placeholders using object-oriented layers and the Dataset API. This is the closest I could get to totally compatible code:
m = 128 # num_examples
n = 5 # num features
epochs = 2
batch_size = 32
steps_per_epoch = m // 32
dataset = tf.data.Dataset.from_tensor_slices(
(tf.random_uniform([m, n], dtype=tf.float32),
tf.random_uniform([m, 1], dtype=tf.float32)))
dataset = dataset.repeat(epochs)
dataset = dataset.batch(batch_size)
model = tf.keras.Sequential([
tf.keras.layers.Dense(64, activation='relu', input_dim=n),
tf.keras.layers.Dense(32, activation='relu'),
tf.keras.layers.Dense(1)
])
def train_eagerly(model, dataset):
optimizer = tf.train.AdamOptimizer()
iterator = dataset.make_one_shot_iterator()
print('Training graph...')
for epoch in range(epochs):
print('Epoch', epoch)
progbar = tf.keras.utils.Progbar(target=steps_per_epoch, stateful_metrics='loss')
for step in range(steps_per_epoch):
with tf.GradientTape() as tape:
features, labels = iterator.get_next()
predictions = model(features, training=True)
loss_value = tf.losses.mean_squared_error(labels, predictions)
grads = tape.gradient(loss_value, model.variables)
optimizer.apply_gradients(zip(grads, model.variables))
progbar.add(1, values=[('loss', loss_value.numpy())])
def train_graph(model, dataset):
optimizer = tf.train.AdamOptimizer()
iterator = dataset.make_initializable_iterator()
print('Training graph...')
with tf.Session() as sess:
sess.run(iterator.initializer)
sess.run(tf.global_variables_initializer())
for epoch in range(epochs):
print('Epoch', epoch)
progbar = tf.keras.utils.Progbar(target=steps_per_epoch, stateful_metrics='loss')
for step in range(steps_per_epoch):
with tf.GradientTape() as tape:
features, labels = sess.run(iterator.get_next())
predictions = model(features, training=True)
loss_value = tf.losses.mean_squared_error(labels, predictions)
grads = tape.gradient(loss_value, model.variables)
optimizer.apply_gradients(zip(grads, model.variables))
progbar.add(1, values=[('loss', loss_value.eval())])
As you can see, the main difference is that I use a one_shot_iterator during Eager training (of course, during graph training, I have to run operations within a session).
I tried to do the same using optimizer.minimize instead of applying the gradients myself, but I could not come up with a code that worked both for eager and graph modes.
Also, I'm sure this becomes much harder to do with not so simple models, like the one you are working with.

Saving the state of the AdaGrad algorithm in Tensorflow

I am trying to train a word2vec model, and want to use the embeddings for another application. As there might be extra data later, and my computer is slow when training, I would like my script to stop and resume training later.
To do this, I created a saver:
saver = tf.train.Saver({"embeddings": embeddings,"embeddings_softmax_weights":softmax_weights,"embeddings_softmax_biases":softmax_biases})
I save the embeddings, and softmax weights and biases so I can resume training later. (I assume that this is the correct way, but please correct me if I'm wrong).
Unfortunately when resuming training with this script the average loss seems to go up again.
My idea is that this can be attributed to the AdaGradOptimizer I'm using. Initially the outer product matrix will probably be set to all zero's, where after my training it will be filled (leading to a lower learning rate).
Is there a way to save the optimizer state to resume learning later?
While TensorFlow seems to complain when you attempt to serialize an optimizer object directly (e.g. via tf.add_to_collection("optimizers", optimizer) and a subsequent call to tf.train.Saver().save()), you can save and restore the training update operation which is derived from the optimizer:
# init
if not load_model:
optimizer = tf.train.AdamOptimizer(1e-4)
train_step = optimizer.minimize(loss)
tf.add_to_collection("train_step", train_step)
else:
saver = tf.train.import_meta_graph(modelfile+ '.meta')
saver.restore(sess, tf.train.latest_checkpoint('./'))
train_step = tf.get_collection("train_step")[0]
# training loop
while training:
if iteration % save_interval == 0:
saver = tf.train.Saver()
save_path = saver.save(sess, filepath)
I do not know of a way to get or set the parameters specific to an existing optimizer, so I do not have a direct way of verifying that the optimizer's internal state was restored, but training resumes with loss and accuracy comparable to when the snapshot was created.
I would also recommend using the parameterless call to Saver() so that state variables not specifically mentioned will still be saved, although this might not be strictly necessary.
You may also wish to save the iteration or epoch number for later restoring, as detailed in this example:
http://www.seaandsailor.com/tensorflow-checkpointing.html