Using batch size with TensorFlow Validation Monitor - tensorflow

I'm using tf.contrib.learn.Estimator to train a CNN having 20+ layers. I'm using GTX 1080 (8 GB) for training. My dataset is not so large but my GPU runs out of memory with a batch size greater than 32. So I'm using a batch size of 16 for training and Evaluating the classifier (GPU runs out of memory while evaluation as well if a batch_size is not specified).
# Configure the accuracy metric for evaluation
metrics = {
"accuracy":
learn.MetricSpec(
metric_fn=tf.metrics.accuracy, prediction_key="classes"),
}
# Evaluate the model and print results
eval_results = classifier.evaluate(
x=X_test, y=y_test, metrics=metrics, batch_size=16)
Now the problem is that after every 100 steps, I only get training loss printed on screen. I want to print validation loss and accuracy as well, So I'm using a ValidationMonitor
validation_monitor = tf.contrib.learn.monitors.ValidationMonitor(
X_test,
y_test,
every_n_steps=50)
# Train the model
classifier.fit(
x=X_train,
y=y_train,
batch_size=8,
steps=20000,
monitors=[validation_monitor]
ActualProblem: My code crashes (Out of Memory) when I use ValidationMonitor, I think the problem might be solved if I could specify a batch size here as well and I can't figure out how to do that. I want ValidationMonitor to evaluate my validation data in batches, like I do it manually after training using classifier.evaluate, is there a way to do that?

The ValidationMonitor's constructor accepts a batch_size arg that should do the trick.

You need to add config=tf.contrib.learn.RunConfig( save_checkpoints_secs=save_checkpoints_secs) in your model definition. The save_checkpoints_secs can be changed to save_checkpoints_steps, but not both.

Related

training model CNN KERAS

hello everyone i am trying to train a model using cnn and keras but the training don't finish and i got this warning and it stops training , i don't know why and i didn't understand where the problem is can anyone gives me a advice or what i should change in the code
def myModel():
no_Of_Filters=60
size_of_Filter=(5,5) # THIS IS THE KERNEL THAT MOVE AROUND THE IMAGE TO GET THE FEATURES.
# THIS WOULD REMOVE 2 PIXELS FROM EACH BORDER WHEN USING 32 32 IMAGE
size_of_Filter2=(3,3)
size_of_pool=(2,2) # SCALE DOWN ALL FEATURE MAP TO GERNALIZE MORE, TO REDUCE OVERFITTING
no_Of_Nodes = 500 # NO. OF NODES IN HIDDEN LAYERS
model= Sequential()
model.add((Conv2D(no_Of_Filters,size_of_Filter,input_shape=(imageDimesions[0],imageDimesions[1],1),activation='relu'))) # ADDING MORE CONVOLUTION LAYERS = LESS FEATURES BUT CAN CAUSE ACCURACY TO INCREASE
model.add((Conv2D(no_Of_Filters, size_of_Filter, activation='relu')))
model.add(MaxPooling2D(pool_size=size_of_pool)) # DOES NOT EFFECT THE DEPTH/NO OF FILTERS
model.add((Conv2D(no_Of_Filters//2, size_of_Filter2,activation='relu')))
model.add((Conv2D(no_Of_Filters // 2, size_of_Filter2, activation='relu')))
model.add(MaxPooling2D(pool_size=size_of_pool))
model.add(Dropout(0.5))
model.add(Flatten())
model.add(Dense(no_Of_Nodes,activation='relu'))
model.add(Dropout(0.5)) # INPUTS NODES TO DROP WITH EACH UPDATE 1 ALL 0 NONE
model.add(Dense(noOfClasses,activation='softmax')) # OUTPUT LAYER
# COMPILE MODEL
model.compile(Adam(lr=0.001),loss='categorical_crossentropy',metrics=['accuracy'])
return model
############################### TRAIN
model = myModel()
print(model.summary())
history=model.fit_generator(dataGen.flow(X_train,y_train,batch_size=batch_size_val),steps_per_epoch=steps_per_epoch_val,epochs=epochs_val,validation_data=(X_validation,y_validation),shuffle=1)
WARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches (in this case, 20000 batches). You may need to use the repeat() function when building your dataset.
While using generators, you can either run the model without the step_per_epoch parameter and let the model figure out how many steps are there to cover an epoch.
history=model.fit_generator(dataGen.flow(X_train,y_train,batch_size=batch_size_val),
epochs=epochs_val,
validation_data=(X_validation,y_validation),
shuffle=1)
OR
you'll have to calculate steps_per_epoch and use it while training as follows;
history=model.fit_generator(dataGen.flow(X_train,y_train,batch_size=batch_size_val),
steps_per_epoch=(data_samples/batch_size)
epochs=epochs_val,
validation_data=(X_validation,y_validation),
shuffle=1)
Let us know if the issue still persists. Thanks!

Use of tf.GradientTape() exhausts all the gpu memory, without it it doesn't matter

I'm working on Convolution Tasnet, model size I made is about 5.05 million variables.
I want to train this using custom training loops, and the problem is,
for i, (input_batch, target_batch) in enumerate(train_ds): # each shape is (64, 32000, 1)
with tf.GradientTape() as tape:
predicted_batch = cv_tasnet(input_batch, training=True) # model name
loss = calculate_sisnr(predicted_batch, target_batch) # some custom loss
trainable_vars = cv_tasnet.trainable_variables
gradients = tape.gradient(loss, trainable_vars)
cv_tasnet.optimizer.apply_gradients(zip(gradients, trainable_vars))
This part exhausts all the gpu memory (24GB available)..
When I tried without tf.GradientTape() as tape,
for i, (input_batch, target_batch) in enumerate(train_ds):
predicted_batch = cv_tasnet(input_batch, training=True)
loss = calculate_sisnr(predicted_batch, target_batch)
This uses a reasonable amount of gpu memory(about 5~6GB).
I tried the same format of tf.GradientTape() as tape for the basic mnist data, then it works without problem.
So would the size matter? but the same error arises when I lowered BATCH_SIZE to 32 or smaller.
Why the 1st code block exhausts all the gpu memory?
Of course, I put
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
try:
# Currently, memory growth needs to be the same across GPUs
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
except RuntimeError as e:
# Memory growth must be set before GPUs have been initialized
print(e)
this code at the very first cell.
Gradient tape triggers automatic differentiation which requires tracking gradients on all your weights and activations. Autodiff requires multiple more memory. This is normal. You'll have to manually tune your batch size until you find one that works, then tune your LR. Usually, the tune just means guess & check or grid search. (I am working on a product to do all of that for you but I'm not here to plug it).

Approximating a smooth multidimensional function using Keras to an error of 1e-4

I am trying to approximate a function that smoothly maps five inputs to a single probability using Keras, but seem to have hit a limit. A similar problem was posed here (Keras Regression to approximate function (goal: loss < 1e-7)) for a ten-dimensional function and I have found that the architecture proposed there, namely:
model = Sequential()
model.add(Dense(128,input_shape=(5,), activation='tanh'))
model.add(Dense(64,activation='tanh'))
model.add(Dense(1,activation='sigmoid'))
model.compile(optimizer='adam', loss='mae')
gives me my best results, converging to a best loss of around 7e-4 on my validation data when the batch size is 1000. Adding or removing more neurons or layers seems to reduce the accuracy. Dropout regularisation also reduces accuracy. I am currently using 1e7 training samples, which took two days to generate (hence the desire to approximate this function). I would like to reduce the mae by another order of magnitude, does anyone have any suggestions how to do this?
I recommend use utilize the keras callbacks ReduceLROnPlateau, documentation is [here][1] and ModelCheckpoint, documentation is [here.][2]. For the first, set it to monitory validation loss and it will reduce the learning rate by a factor(factor) if the loss fails to reduce after a fixed number (patience) of consecutive epochs. For the second also monitor validation loss and save the weights for the model with the lowest validation loss to a directory. After training load the weights and use them to evaluate or predict on the test set. My code implementation is shown below.
checkpoint=tf.keras.callbacks.ModelCheckpoint(filepath=save_loc, monitor='val_loss', verbose=1, save_best_only=True,
save_weights_only=True, mode='auto', save_freq='epoch', options=None)
lr_adjust=tf.keras.callbacks.ReduceLROnPlateau( monitor="val_loss", factor=0.5, patience=1, verbose=1, mode="auto",
min_delta=0.00001, cooldown=0, min_lr=0)
callbacks=[checkpoint, lr_adjust]
history = model.fit_generator( train_generator, epochs=EPOCHS,
steps_per_epoch=STEPS_PER_EPOCH,validation_data=validation_generator,
validation_steps=VALIDATION_STEPS, callbacks=callbacks)
model.load_weights(save_loc) # load the saved weights
# after this use the model to evaluate or predict on the test set.
# if you are satisfied with the results you can then save the entire model with
model.save(save_loc)
[1]: https://keras.io/api/callbacks/reduce_lr_on_plateau/
[2]: https://keras.io/api/callbacks/model_checkpoint/

batch_size in tf model.fit() vs. batch_size in tf.data.Dataset

I have a large dataset that can fit in host memory. However, when I use tf.keras to train the model, it yields GPU out-of-memory problem. Then I look into tf.data.Dataset and want to use its batch() method to batch the training dataset so that it can execute the model.fit() in GPU. According to its documentation, an example is as follows:
train_dataset = tf.data.Dataset.from_tensor_slices((train_examples, train_labels))
test_dataset = tf.data.Dataset.from_tensor_slices((test_examples, test_labels))
BATCH_SIZE = 64
SHUFFLE_BUFFER_SIZE = 100
train_dataset = train_dataset.shuffle(SHUFFLE_BUFFER_SIZE).batch(BATCH_SIZE)
test_dataset = test_dataset.batch(BATCH_SIZE)
Is the BATCH_SIZE in dataset.from_tensor_slices().batch() the same as the batch_size in the tf.keras modelt.fit()?
How should I choose BATCH_SIZE so that GPU has sufficient data to run efficiently and yet its memory is not overflown?
You do not need to pass the batch_size parameter in model.fit() in this case. It will automatically use the BATCH_SIZE that you use in tf.data.Dataset().batch().
As for your other question : the batch size hyperparameter indeed needs to be carefully tuned. On the other hand, if you see OOM errors, you should decrease it until you do not get OOM (normally (but not necessarily) in this manner 32 --> 16 --> 8 ...). In fact you can try non-power of two batch sizes for the decrease purposes.
In your case I would start with a batch_size of 2 an increase it gradually (3-4-5-6...).
You do not need to provide the batch_size parameter if you use the tf.data.Dataset().batch() method.
In fact, even the official documentation states this:
batch_size : Integer or None. Number of samples per gradient update.
If unspecified, batch_size will default to 32. Do not specify the
batch_size if your data is in the form of datasets, generators, or
keras.utils.Sequence instances (since they generate batches).

Unknown number of steps - Training convolution neural network at Google Colab Pro

I am trying to run (training) my CNN at Google Colab Pro, when I run my code, all is allright, but It does not know the number of steps, so an infinite loop is created.
Mounted at /content/drive
2.2.0-rc3
Found 10018 images belonging to 2 classes.
Found 1336 images belonging to 2 classes.
WARNING:tensorflow:`period` argument is deprecated. Please use `save_freq` to specify the frequency in number of batches seen.
Epoch 1/300
8/Unknown - 364s 45s/step - loss: 54.9278 - accuracy: 0.5410
I am using ImageDataGenerator() for loadings images. How can I fix it?
An iterator does not store anything, it generates the data dynamically. When you are using a dataset or dataset iterator, you must provide steps_per_epoch. The length of an iterator is unknown until you iterate through it. You could explicitly pass len(datafiles) into the .fit function. So, You need to provide steps_per_epoch as shown below.
model.fit_generator(
train_data_gen,
steps_per_epoch=total_train // batch_size,
epochs=epochs,
validation_data=val_data_gen,
validation_steps=total_val // batch_size
)
More details are mentioned here
steps_per_epoch: Integer or None. Total number of steps (batches of
samples) before declaring one epoch finished and starting the next
epoch. When training with input tensors such as TensorFlow data
tensors, the default None is equal to the number of samples in your
dataset divided by the batch size, or 1 if that cannot be determined.
If x is a tf.data dataset, and 'steps_per_epoch' is None, the epoch
will run until the input dataset is exhausted. This argument is not
supported with array inputs.
I notice you are using binary classification. One more thing to remember when you use ImageDataGenerator is to provide class_mode as shown below. Otherwise, there will be a bug (in keras) or 50% accuracy (in tf.keras).
train_data_gen = train_image_generator.flow_from_directory(batch_size=batch_size,
directory=train_dir,
shuffle=True,
target_size=(IMG_HEIGHT, IMG_WIDTH),class_mode='binary') #