Keras OOM for data validation using GPU - tensorflow

I'm trying to run a deep model using GPU and seems Keras is running the validation against the whole validation data set in one batch instead of validating in many batches and that's causing out of memory problem:
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM
when allocating tensor with shape[160000,64,64,1] and type double on
/job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[Op:GatherV2]
I did not have this problem when I was running on CPU, it's just happening when I'm running on GPU, my fit code looks like this
history = model.fit(patches_imgs_train, patches_masks_train, batch_size=8,
epochs=10, shuffle=True, verbose=1, validation_split=0.2)
When I delete the validation parameter from the fit method the code works, but I need the validation.

Since no one is answering this, I can offer you a workaround. You can separate fit() and evaluate() and run the evaluation on CPU.
You'll have to split your data manually to provide the testx and testy to evaluate().
for i in range(10):
with tf.device('/GPU:0'):
model.fit(x, y, epochs=1)
with tf.device('/CPU:0'):
loss, acc = model.evaluate(testx, testy)
You'll need deal with the accuracy values if you wanted some early stop.
It isn't perfect but it'll allow you to run much larger networks without OOMs.
Hope it helps.

So I could consider what is happening as a bug in Keras implementation, looks like it's trying to load the whole data set to the memory for splitting it into validation and training sets and it's not related to batch size, after trying many ways to go around it I found the best way to approach it is splitting the data using sklearn train_test_split instead of splitting it down in the fitting method using validation_split param.
x_train, x_v, y_train, y_v = train_test_split(x,y,test_size = 0.2,train_size =0.8)
history = model.fit(x_train,y_train,
batch_size=16,
epochs=5,
shuffle=True,
verbose=2,
validation_data=(x_v, y_v))

Related

Approximating a smooth multidimensional function using Keras to an error of 1e-4

I am trying to approximate a function that smoothly maps five inputs to a single probability using Keras, but seem to have hit a limit. A similar problem was posed here (Keras Regression to approximate function (goal: loss < 1e-7)) for a ten-dimensional function and I have found that the architecture proposed there, namely:
model = Sequential()
model.add(Dense(128,input_shape=(5,), activation='tanh'))
model.add(Dense(64,activation='tanh'))
model.add(Dense(1,activation='sigmoid'))
model.compile(optimizer='adam', loss='mae')
gives me my best results, converging to a best loss of around 7e-4 on my validation data when the batch size is 1000. Adding or removing more neurons or layers seems to reduce the accuracy. Dropout regularisation also reduces accuracy. I am currently using 1e7 training samples, which took two days to generate (hence the desire to approximate this function). I would like to reduce the mae by another order of magnitude, does anyone have any suggestions how to do this?
I recommend use utilize the keras callbacks ReduceLROnPlateau, documentation is [here][1] and ModelCheckpoint, documentation is [here.][2]. For the first, set it to monitory validation loss and it will reduce the learning rate by a factor(factor) if the loss fails to reduce after a fixed number (patience) of consecutive epochs. For the second also monitor validation loss and save the weights for the model with the lowest validation loss to a directory. After training load the weights and use them to evaluate or predict on the test set. My code implementation is shown below.
checkpoint=tf.keras.callbacks.ModelCheckpoint(filepath=save_loc, monitor='val_loss', verbose=1, save_best_only=True,
save_weights_only=True, mode='auto', save_freq='epoch', options=None)
lr_adjust=tf.keras.callbacks.ReduceLROnPlateau( monitor="val_loss", factor=0.5, patience=1, verbose=1, mode="auto",
min_delta=0.00001, cooldown=0, min_lr=0)
callbacks=[checkpoint, lr_adjust]
history = model.fit_generator( train_generator, epochs=EPOCHS,
steps_per_epoch=STEPS_PER_EPOCH,validation_data=validation_generator,
validation_steps=VALIDATION_STEPS, callbacks=callbacks)
model.load_weights(save_loc) # load the saved weights
# after this use the model to evaluate or predict on the test set.
# if you are satisfied with the results you can then save the entire model with
model.save(save_loc)
[1]: https://keras.io/api/callbacks/reduce_lr_on_plateau/
[2]: https://keras.io/api/callbacks/model_checkpoint/

Early stopping based on AUC

I am fairly new to ML and am currently implementing a simple 3D CNN in python using tensorflow and keras. I want to optimize based on the AUC and would also like to use early stopping/save the best network in terms of AUC score. I have been using tensorflow's AUC function for this as shown below, and it works well for the training. However, the hdf5 file is not saved (despite the checkpoint save_best_only=True) and hence I cannot get the best weights for the evaluation.
Here are the relevant lines of code:
model.compile(loss='binary_crossentropy',
optimizer=keras.optimizers.Adam(lr=lr),
metrics=[tf.keras.metrics.AUC()])
model.load_weights(path_weights)
filepath = mypath
check = tf.keras.callbacks.ModelCheckpoint(filepath, monitor=tf.keras.metrics.AUC(), save_best_only=True,
mode='auto')
earlyStopping = tf.keras.callbacks.EarlyStopping(monitor=tf.keras.metrics.AUC(), patience=hyperparams['pat'],mode='auto')
history = model.fit(X_trn, y_trn,
batch_size=bs,
epochs=n_epochs,
verbose=1,
callbacks=[check, earlyStopping],
validation_data=(X_val, y_val),
shuffle=True)
Interestingly, if I only change monitor='val_loss' in the early stopping and checkpoint (not the 'metrics' in model.compile), the hdf5 file is saved but obviously gives the best result in terms of validation loss. I have also tried using mode='max' but the problem is the same.
I would very much appreciate your advise, or any other constructive ideas how to work around this problem.
Turns out that even if you add a non-keyword metric, you still need to use its handle to refer to in when you want to monitor it. In your case you can do this:
auc = tf.keras.metrics.AUC() # instantiate it here to have a shorter handle
model.compile(loss='binary_crossentropy',
optimizer=keras.optimizers.Adam(lr=lr),
metrics=[auc])
...
check = tf.keras.callbacks.ModelCheckpoint(filepath,
monitor='auc', # even use the generated handle for monitoring the training AUC
save_best_only=True,
mode='max') # determine better models according to "max" AUC.
if you want to monitor the validation AUC (which makes more sense), simply add val_ in the beginning of the handle:
check = tf.keras.callbacks.ModelCheckpoint(filepath,
monitor='val_auc', # validation AUC
save_best_only=True,
mode='max')
Another problem is that you ModelCheckpoint is saving the weights based on the minimum AUC instead of the max, which you want.
This can be changed by setting mode='max'.
What does mode='auto' do?
This setting essentially checks if the argument of monitor contains 'acc' and sets it to max. In any other case it sets uses mode='min', which is what is happening in your case.
You can confirm this here
The answer posted by Djib2011 should solve your problem. I just wanted to address the use of early stopping. Typically this is used to stop training when over fitting starts to cause the loss to increase. I think it is more effective to address the over fitting issue directly which should enable you to achieve a lower loss. You did not list your model so it is not clear how to address over fitting but some simple guidelines are as follows. If you havee several dense hidden layers at the top of the model delete most of them and just keep the final top dense layer. The more complex the model the more it is prone to over fitting. If that leads to lower training accuracy then keep the layers but add dropout layers. You might also try using regularization in the hidden dense layers. I also find it is beneficial to use the callback ReduceLROnPlateau. Set it up to monitor AUC and reduce the learning rate if it fails to improve.

On Keras, what's the difference between running a model.fit on n epochs vs. running n times model.fit on 1 epoch?

When training a model, I often use the epochs= attribute of model.fit method, like the following:
model.fit(x, y, epochs=100, ...)
But I saw some kernels on Kaggle using the for loop approach, like this:
for i in range(0, 100):
model.fit(x, y, epochs=1, ...)
Intuitively, I would say they are different because model.fit can perform some kind of parameters initialization, but I might be wrong.
Can anyone point out the difference?
Thanks
You're correct. When you run model.fit() the weights are getting initialized by specified weights initializer or by default initializer if none is specified. Maybe you can disable weight initialization somehow but I think it's much easier to just do it with specified number of epochs rather then in loop.

Keras with tensorflow throws ResourceExhaustedError

For research purposes, I am training a neural network that is updating its weights differently depending on the parity of the epoch:
1) If the epoch is even, change the weights of the NN with backpropagation
2) If the epoch is odd, only update the model with update_weights_with_custom_function() therefore freeze the network.
Here is a simplified part of the code that implements this (notice the epochs=1):
for epoch in range(nb_epoch):
if epoch % 2 == 0:
model.trainable = True # Unfreeze the model
else:
model.trainable = False # Freeze the model
model.compile(optimizer=optim, loss=gaussian_loss, metrics=['accuracy'])
hist = model.fit(X_train, Y_train,
batch_size=batch_size,
epochs=1,
shuffle=True,
verbose=1,
callbacks=[tbCallBack, csv_epochs, early_stop],
validation_data=(X_val, Y_val))
if epoch % 2 == 1:
update_weights_with_custom_function()
Problem: after a few epoch, keras throws a ResourceExhaustedError but only with tensorflow, not with theano. It seems that looping over compile() is creating models without releasing them.
Therefore, what should I do? I know that K.clear_session() releases memory but it requires to save the model and reload it (see) which gives me some issues as load_model() in my case does not work out of the box.
I'm also open to other ways to do what I am trying to achieve (i.e. freezing a NN model depending on the parity of the epoch).
Summary: keras with tensorflow backend is throwing a ResourceExhaustedError because I am looping over compile().
As Marcin Możejko pointed out, using eval() is doing exactly what I was trying to achieve.
I added a custom callback (inspiration was here), which avoids the loop over compile()
The problem is now solved, even if the tensorflow issue was not addressed directly.

Using batch size with TensorFlow Validation Monitor

I'm using tf.contrib.learn.Estimator to train a CNN having 20+ layers. I'm using GTX 1080 (8 GB) for training. My dataset is not so large but my GPU runs out of memory with a batch size greater than 32. So I'm using a batch size of 16 for training and Evaluating the classifier (GPU runs out of memory while evaluation as well if a batch_size is not specified).
# Configure the accuracy metric for evaluation
metrics = {
"accuracy":
learn.MetricSpec(
metric_fn=tf.metrics.accuracy, prediction_key="classes"),
}
# Evaluate the model and print results
eval_results = classifier.evaluate(
x=X_test, y=y_test, metrics=metrics, batch_size=16)
Now the problem is that after every 100 steps, I only get training loss printed on screen. I want to print validation loss and accuracy as well, So I'm using a ValidationMonitor
validation_monitor = tf.contrib.learn.monitors.ValidationMonitor(
X_test,
y_test,
every_n_steps=50)
# Train the model
classifier.fit(
x=X_train,
y=y_train,
batch_size=8,
steps=20000,
monitors=[validation_monitor]
ActualProblem: My code crashes (Out of Memory) when I use ValidationMonitor, I think the problem might be solved if I could specify a batch size here as well and I can't figure out how to do that. I want ValidationMonitor to evaluate my validation data in batches, like I do it manually after training using classifier.evaluate, is there a way to do that?
The ValidationMonitor's constructor accepts a batch_size arg that should do the trick.
You need to add config=tf.contrib.learn.RunConfig( save_checkpoints_secs=save_checkpoints_secs) in your model definition. The save_checkpoints_secs can be changed to save_checkpoints_steps, but not both.