Keras with tensorflow throws ResourceExhaustedError - tensorflow

For research purposes, I am training a neural network that is updating its weights differently depending on the parity of the epoch:
1) If the epoch is even, change the weights of the NN with backpropagation
2) If the epoch is odd, only update the model with update_weights_with_custom_function() therefore freeze the network.
Here is a simplified part of the code that implements this (notice the epochs=1):
for epoch in range(nb_epoch):
if epoch % 2 == 0:
model.trainable = True # Unfreeze the model
else:
model.trainable = False # Freeze the model
model.compile(optimizer=optim, loss=gaussian_loss, metrics=['accuracy'])
hist = model.fit(X_train, Y_train,
batch_size=batch_size,
epochs=1,
shuffle=True,
verbose=1,
callbacks=[tbCallBack, csv_epochs, early_stop],
validation_data=(X_val, Y_val))
if epoch % 2 == 1:
update_weights_with_custom_function()
Problem: after a few epoch, keras throws a ResourceExhaustedError but only with tensorflow, not with theano. It seems that looping over compile() is creating models without releasing them.
Therefore, what should I do? I know that K.clear_session() releases memory but it requires to save the model and reload it (see) which gives me some issues as load_model() in my case does not work out of the box.
I'm also open to other ways to do what I am trying to achieve (i.e. freezing a NN model depending on the parity of the epoch).
Summary: keras with tensorflow backend is throwing a ResourceExhaustedError because I am looping over compile().

As Marcin Możejko pointed out, using eval() is doing exactly what I was trying to achieve.
I added a custom callback (inspiration was here), which avoids the loop over compile()
The problem is now solved, even if the tensorflow issue was not addressed directly.

Related

Validation loss reported seems to be wrong, can preprocessing be the reason?

I'm training a resnet model with Keras, fine tuned on my own images. While training, Tensorboard is constantly reporting a validation loss that seems unrelated to training loss (much higher, see image below where train is orange line and validation blue line). Furthermore when training is finished (for example final losses as reported by Tensorboard could be respectively 0.06 and 0.57) I evaluate the model "manually" and validation loss seems to be in the same range of training loss (ex:0.07).
I suspect that preprocessing could be the reason of this strange result. Essentially the inputs and the outputs of the model are created like this:
inp = tf.keras.Input(input_shape)
resnet = tf.keras.applications.ResNet50V2(include_top=False, input_shape=input_shape, input_tensor=inp,pooling="avg")
# Add ResNet50V2 specific preprocessing method into the model.
preprocessed = tf.keras.layers.Lambda(lambda x: tf.keras.applications.resnet_v2.preprocess_input(x))(inp)
out = resnet(preprocessed)
out = tf.keras.layers.Dense(num_outputs, activation=None)(out)
and the training :
model.compile(
optimizer=tf.keras.optimizers.Adam(lrate),
loss='mse',
metrics=[tf.keras.metrics.MeanSquaredError()],
)
model.fit(
train_dataset,
epochs=epochs,
validation_data=val_dataset,
callbacks=callbacks
)
It's like if preprocessing does not occur when validation loss is calculated but I don't know why.

Approximating a smooth multidimensional function using Keras to an error of 1e-4

I am trying to approximate a function that smoothly maps five inputs to a single probability using Keras, but seem to have hit a limit. A similar problem was posed here (Keras Regression to approximate function (goal: loss < 1e-7)) for a ten-dimensional function and I have found that the architecture proposed there, namely:
model = Sequential()
model.add(Dense(128,input_shape=(5,), activation='tanh'))
model.add(Dense(64,activation='tanh'))
model.add(Dense(1,activation='sigmoid'))
model.compile(optimizer='adam', loss='mae')
gives me my best results, converging to a best loss of around 7e-4 on my validation data when the batch size is 1000. Adding or removing more neurons or layers seems to reduce the accuracy. Dropout regularisation also reduces accuracy. I am currently using 1e7 training samples, which took two days to generate (hence the desire to approximate this function). I would like to reduce the mae by another order of magnitude, does anyone have any suggestions how to do this?
I recommend use utilize the keras callbacks ReduceLROnPlateau, documentation is [here][1] and ModelCheckpoint, documentation is [here.][2]. For the first, set it to monitory validation loss and it will reduce the learning rate by a factor(factor) if the loss fails to reduce after a fixed number (patience) of consecutive epochs. For the second also monitor validation loss and save the weights for the model with the lowest validation loss to a directory. After training load the weights and use them to evaluate or predict on the test set. My code implementation is shown below.
checkpoint=tf.keras.callbacks.ModelCheckpoint(filepath=save_loc, monitor='val_loss', verbose=1, save_best_only=True,
save_weights_only=True, mode='auto', save_freq='epoch', options=None)
lr_adjust=tf.keras.callbacks.ReduceLROnPlateau( monitor="val_loss", factor=0.5, patience=1, verbose=1, mode="auto",
min_delta=0.00001, cooldown=0, min_lr=0)
callbacks=[checkpoint, lr_adjust]
history = model.fit_generator( train_generator, epochs=EPOCHS,
steps_per_epoch=STEPS_PER_EPOCH,validation_data=validation_generator,
validation_steps=VALIDATION_STEPS, callbacks=callbacks)
model.load_weights(save_loc) # load the saved weights
# after this use the model to evaluate or predict on the test set.
# if you are satisfied with the results you can then save the entire model with
model.save(save_loc)
[1]: https://keras.io/api/callbacks/reduce_lr_on_plateau/
[2]: https://keras.io/api/callbacks/model_checkpoint/

Keras OOM for data validation using GPU

I'm trying to run a deep model using GPU and seems Keras is running the validation against the whole validation data set in one batch instead of validating in many batches and that's causing out of memory problem:
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM
when allocating tensor with shape[160000,64,64,1] and type double on
/job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[Op:GatherV2]
I did not have this problem when I was running on CPU, it's just happening when I'm running on GPU, my fit code looks like this
history = model.fit(patches_imgs_train, patches_masks_train, batch_size=8,
epochs=10, shuffle=True, verbose=1, validation_split=0.2)
When I delete the validation parameter from the fit method the code works, but I need the validation.
Since no one is answering this, I can offer you a workaround. You can separate fit() and evaluate() and run the evaluation on CPU.
You'll have to split your data manually to provide the testx and testy to evaluate().
for i in range(10):
with tf.device('/GPU:0'):
model.fit(x, y, epochs=1)
with tf.device('/CPU:0'):
loss, acc = model.evaluate(testx, testy)
You'll need deal with the accuracy values if you wanted some early stop.
It isn't perfect but it'll allow you to run much larger networks without OOMs.
Hope it helps.
So I could consider what is happening as a bug in Keras implementation, looks like it's trying to load the whole data set to the memory for splitting it into validation and training sets and it's not related to batch size, after trying many ways to go around it I found the best way to approach it is splitting the data using sklearn train_test_split instead of splitting it down in the fitting method using validation_split param.
x_train, x_v, y_train, y_v = train_test_split(x,y,test_size = 0.2,train_size =0.8)
history = model.fit(x_train,y_train,
batch_size=16,
epochs=5,
shuffle=True,
verbose=2,
validation_data=(x_v, y_v))

Training and testing in convolutional neural networks

I would like to know at what stage testing dataset is used CNNs? Is it used after completion of each batch or one epoch during training or is it used after completion of all the epochs ? I am a bit confused as to how these two processes run together ? Similarly gradient updation is done after each batch or each epoch ?
model.fit_generator(
aug.flow(x_train, y_train, batch_size=BATCH_SIZE),
validation_data=(x_test, y_test),
steps_per_epoch=len(x_train) // BATCH_SIZE,
epochs=EPOCHS, verbose=1, callbacks = callbacks)
From fit_generator it is only clear that images are loaded batch by batch onto memory.
Keras is using validation datasets in the end of every epoch (if you didn't change validation_freq in the fit function). Each epoch your model trains on the whole train dataset and later evaluates itself on the validation dataset

TensorBoard Callback in Keras does not respect initial_epoch of fit?

I'm trying to train multiple models in parallel on a single graphics card. To achieve that I need to resume training of models from saved weights which is not a problem. The model.fit() method has even a parameter initial_epoch that lets me tell the model which epoch the loaded model is on. However when i pass a TensorBoard callback to the fit() method in order to monitor the training of the models, on Tensorboard all data is shown on x=0.
Is there a ways to overcome this and adjust the epoch on tensorboard?
By the way: Im running Keras 2.0.6 and Tensorflow 1.3.0.
self.callbacks = [TensorBoardCallback(log_dir='./../logs/'+self.model_name, histogram_freq=0, write_graph=True, write_images=False, start_epoch=self.step_num)]
self.model.fit(x=self.data['X_train'], y=self.data['y_train'], batch_size=self.input_params[-1]['batch_size'], epochs=1, validation_data=(self.data['X_test'], self.data['y_test']), verbose=verbose, callbacks=self.callbacks, shuffle=self.hyperparameters['shuffle_data'], initial_epoch=self.step_num)
self.model.save_weights('./weights/%s.hdf5'%(self.model_name))
self.model.load_weights('./weights/%s.hdf5'%(self.model_name))
self.model.fit(x=self.data['X_train'], y=self.data['y_train'], batch_size=self.input_params[-1]['batch_size'], epochs=1, validation_data=(self.data['X_test'], self.data['y_test']), verbose=verbose, callbacks=self.callbacks, shuffle=self.hyperparameters['shuffle_data'], initial_epoch=self.step_num)
self.model.save_weights('./weights/%s.hdf5'%(self.model_name))
The resulting graph on Tensorboard looks like this which is not what i was hoping for:
Update:
When passing epochs=10 to the first model.fit() the 10 epoch results are displayed in TensorBoard (see picture).
However when reloading the model and running it (with the same callback attached) the on_epoch_end method of the callback gets never called.
Turns out that when i pass the number of episodes to model.fit() to tell it how long to train, it has to be the number FROM the initial_epoch specified. So if initial_epoch=self.step_num then , epochs=self.step_num+10 if i want to train for 10 episodes.
Say we just started fitting our model and our first time epoch count is 30
(please ignore other paramterers just look at epochs and initial_epoch)
model.fit(train_dataloader,validation_data = test_dataloader,epochs =30,steps_per_epoch = len(train_dataloader),callbacks = callback_list)
Now say ,after 30 epoch we want to start again from 31st epoch (you can see this in tesnorboard) by changing our Adam optimizer(or nay optimizer) learning rate
so we can do is
model.optimizer.learning_rate = 0.0005
model1.fit(train_dataloader,validation_data = test_dataloader,initial_epoch=30,epochs =55,steps_per_epoch = len(train_dataloader),callbacks = callback_list)
=> So here initial_epoch= where we have left training last time;
epochs= initial_epoch+num_epoch we want to run for this second fit