Keras fit freezes at the end of the first epoch - tensorflow

I am currently experimenting with fine tuning the VGG16 network using Keras.
I started tweaking a little bit with the cats and dogs dataset.
However, with the current configuration the training seems to block on the first epoch
from keras import applications
from keras.preprocessing.image import ImageDataGenerator
from keras import optimizers
from keras.models import Sequential, Model
from keras.layers import Dropout, Flatten, Dense
img_width, img_height = 224, 224
train_data_dir = 'data/train'
validation_data_dir = 'data/validation'
nb_train_samples = 2000
nb_validation_samples = 800
epochs = 50
batch_size = 20
model = applications.VGG16(weights='imagenet', include_top=False , input_shape=(224,224,3))
print('Model loaded.')
top_model = Sequential()
top_model.add(Flatten(input_shape=model.output_shape[1:]))
top_model.add(Dense(256, activation='relu',name='newlayer'))
top_model.add(Dropout(0.5))
top_model.add(Dense(2, activation='softmax'))
model = Model(inputs= model.input, outputs= top_model(model.output))
for layer in model.layers[:19]:
layer.trainable = False
model.compile(loss='categorical_crossentropy',
optimizer=optimizers.Adam(lr=0.0001),
metrics=['accuracy'])
train_datagen = ImageDataGenerator(
rescale=1. / 255,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True)
test_datagen = ImageDataGenerator(rescale=1. / 255)
train_generator = train_datagen.flow_from_directory(
train_data_dir,
target_size=(img_height, img_width),
batch_size=batch_size,
shuffle=True,
class_mode='categorical')
validation_generator = test_datagen.flow_from_directory(
validation_data_dir,
target_size=(img_height, img_width),
batch_size=batch_size,
class_mode='categorical')
model.fit_generator(
train_generator,
steps_per_epoch=nb_train_samples// batch_size,
epochs=epochs,
validation_data=validation_generator,
validation_steps=nb_validation_samples)
Last output:
Epoch 1/50 99/100 [============================>.] - ETA: 0s - loss:
0.5174 - acc: 0.7581
Am I missing something ?

Shuffle
In my case, I was calling fit(...) with shuffle='batch'. Removing this parameter from the arguments resolved the problem. (I assume it's a TensorFlow bug but I didn't dig into it.)
Validation
Another consideration is that validation is being performed at the end of the epoch... If your validation data isn't being batched, and particularly if you are padding your data, then you could be performing validation on data much larger than your training batch size padded to the maximum sample length of your validation data. This could be a problem of out-of-memory proportions.

I faced this problem in Co-Lab provides limited memory upto(12 GB) in cloud which creates many issues while solving a problem. That's why only 300 images are used to train and test.when images was preprocessed with dimension 600x600 and batch size was set to 128 it Keras model freezed during epoch 1 .Compiler did not show this error.Actually the error was runtime limited memory which was unable to handle by CoLab because it gave only 12GB limited memory for usage.
Solution to above mentioned problem was solved by changing batch size to 4 and reduce image dimension to 300x300 because with 600x600 it still not work.
Conclusively,Recommend Solution is Make Images dimension and Batch_size small until you get no error Run Again and Again until there will no run time error

I faced the same issue.
This is because the model is running on the validation dataset, and this usually takes a lot of time. Try reducing the validation dataset, or wait for some time it worked for me. It seems like it's stuck, but it is running on the validation dataset.

If you are using from tensorflow.keras.preprocessing.image import ImageDataGenerator, try changing it to from keras.preprocessing.image import ImageDataGenerator, or vice versa. Worked for me. Its said that you should never mix keras and tensorflow.

I tried everything posted in here, but they didn't work for me. I found the solution by simply putting the validation set into a numpy.array like this:
numpy.array(validation_x)
Super simple. Works like a charm. I hope this helps someone.

Related

Keras Data Augmentation with ImageDataGenerator (Your input ran out of data)

I am currently learning how to perform data augmentation with Keras ImageDataGenerator from "Deep learning with Keras" by François Chollet.
I now have 1000 (Dogs) & 1000 (Cats) images in training dataset.
I also have 500(Dogs) & 500(Cats) images in validation dataset.
The book defined the batch size as 32 for both training and validation data in the Generator to perform data augmentation with both "step_per_epoch" and "epoch" in fitting the model.
Hpwever, when I train the model, I received the Tensorflow Warning, "Your input ran out of data..." and stopped the training process.
I searched online and many solutions mentioned that the step_per_epoch should be,
steps_per_epoch = len(train_dataset) // batch_size & steps_per_epoch = len(validation_dataset) // batch_size
I understand the logic above and there is no warning in the training.
But I am wondering, originally I have 2000 training samples. This is too little so that I need to perform data augmentation to increase the numbers of training images.
If the steps_per_epoch = len(train_dataset) // batch_size is applied, since the len(train_dataset) is only 2000. Isn't that I am still using 2000 samples to train the model instead of adding more augmented images to the model?
train_datagen = ImageDataGenerator(
rescale=1./255,
rotation_range=40,
width_shift_range=0.2,
height_shift_range=0.2,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True,)
test_datagen = ImageDataGenerator(rescale=1./255)
train_generator = train_datagen.flow_from_directory(
train_dir,
target_size=(150, 150),
batch_size=32,
class_mode='binary')
validation_generator = test_datagen.flow_from_directory(
validation_dir,
target_size=(150, 150),
batch_size=32,
class_mode='binary')
history = model.fit_generator(
train_generator,
steps_per_epoch=100,
epochs=100,
validation_data=validation_generator,
validation_steps=50)
The fact that, imagedatagenerator does not increase the size of the training set. All augmentations are done in memory. So an original image is augmented randomly, then its augmented version is returned. If you want to have a look to augmented images you need set these parameters for the function flow_from_directory:
save_to_dir=path,
save_prefix="",
save_format="png",
Now you have 2000 images and with a batch size of 32, you would have 2000 // 32 = 62 steps per epoch, but you are trying to have 100 steps which is causing the error.
If you have a dataset which does not generate batches and want to use all data points, then you should set:
steps_per_epoch = len(train_dataset) // batch_size
But when you use flow_from_directory, it generates batches, so there is no need to set steps_per_epoch unless you want to use less data points than generated batches.

What may be the cause of pretty slow training speed for CNN (transfer learning)?

I use GPU for training a model with transfer learning from Inception v3.
Weights='imagenet'. The convolution base is frozen and dense layers on top are used for 10 classes classification for MNIST digit recognition.
The code is the following:
from keras.preprocessing import image
datagen=ImageDataGenerator(
#rescale=1./255,
preprocessing_function=tf.keras.applications.inception_v3.preprocess_input,
featurewise_center=False, # set input mean to 0 over the dataset
samplewise_center=False, # set each sample mean to 0
featurewise_std_normalization=False, # divide inputs by std of the dataset
samplewise_std_normalization=False, # divide each input by its std
zca_whitening=False, # apply ZCA whitening
rotation_range=10, # randomly rotate images in the range (degrees, 0 to 180)
zoom_range = 0.1, # Randomly zoom image
width_shift_range=0.1, # randomly shift images horizontally (fraction of total width)
height_shift_range=0.1, # randomly shift images vertically (fraction of total height)
horizontal_flip=False, # randomly flip images
vertical_flip=False)
train_generator=datagen.flow_from_directory(
train_path,
target_size=(224, 224),
color_mode="rgb",
class_mode="categorical",
batch_size=86,
interpolation="bilinear",
)
test_generator=datagen.flow_from_directory(
test_path,
target_size=(224, 224),
color_mode="rgb",
class_mode="categorical",
batch_size=86,
interpolation="bilinear",
)
#Import pre-trained model InceptionV3
from keras.applications import InceptionV3
#Instantiate convolutional base
conv_base = InceptionV3(weights='imagenet',
include_top=False,
input_shape=(224, 224, 3)) # 3 = number of channels in RGB pictures
#Forbid training of conv part
conv_base.trainable=False
#Build model
model=Sequential()
model.add(conv_base)
model.add(Flatten())
model.add(Dense(256,activation='relu'))
model.add(BatchNormalization())
model.add(Dropout(0.5))
model.add(Dense(10, activation='softmax'))
# Define the optimizer
optimizer = RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0)
# Compile the model
model.compile(optimizer=optimizer,loss="categorical_crossentropy",metrics=['accuracy'] )
history = model.fit_generator(train_generator,
epochs = 1, validation_data = test_generator,
verbose = 2, steps_per_epoch=60000 // 86)
#, callbacks=[learning_rate_reduction])
The obtained training rate was 1 epoch/hour (even after reducing lr to 0.001), when I used rescale=1./255 for data generator.
After searching for answers, I found that the cause my be in not appropriate form for input.
When I tried to use preprocessing_function=tf.keras.applications.inception_v3.preprocess_input,
I received a message after 30 min of training:
Epoch 1/1
/usr/local/lib/python3.6/dist-packages/keras/utils/data_utils.py:616: UserWarning: The input 1449 could not be retrieved. It could be because a worker has died.
UserWarning)
/usr/local/lib/python3.6/dist-packages/keras/utils/data_utils.py:616: UserWarning: The input 614 could not be retrieved. It could be because a worker has died.
UserWarning)
What is wrong with the model?
Thanks in advance.
The learning rate doesn't affect the training rate.
How fast you train the model depends on your gpu, you cpu, and IO on your drive, ceteris paribus.
First, check if your gpu is used for the training.
from keras import backend as K
K.tensorflow_backend._get_available_gpus()
Next, is 32 the max batch_size your gpu can handle? Try increasing the batch_size until you get OOM error.
Or you could monitor your gpu and cpu usage.
If gpu and cpu usage is not maxed, it may be limited by your drive IO speed.
Nothing is wrong in the model.
For increasing the speed of epochs, try the following:
Switch on XLA.
import tensorflow as tf
tf.config.optimizer.set_jit(True)
Use mixed precision
from tensorflow.keras.mixed_precision import experimental as mixed_precision
policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_policy(policy)

DCNN for Binary Classification Converges to 50%/50%

I am new to Keras, and never asked a question here, so excuse me any rookie mistakes I might make.
What I am trying to do is to implement a binary classifier, operating on images (CTs to be exact).
My model is based on a pretrained net, that performed classification on 14 classes (see wonderful git here https://github.com/jrzech/reproduce-chexnet).
As the saying goes, "crawl before you walk, walk before you run", my current humble goal is to achieve overfitting of the network on some 100 examples.
My current problem is that the net converges to a weird solution, with the output neuron (im using sigmoid) always very close to 50%, with 100% of the predictions going to one class (that way im stuck at about 50% accuracy). My loss and accuracy do not change at all from epoch 1 or so.
Things I tried/considered:
using different optimizers (i used Adam optimizer and the following SGD).
trying also to go with categorical crossentropy (with softmax layer at the end, instead of sigmoid, since some say it might perform better [Keras' fit_generator() for binary classification predictions always 50%).
adding an additional denselayer (I thought i might be underfitting somehow).
tried to maybe change the batchsize, to 128 (and overfit on 1000 examples).
All failed miserably, so im kind of at a lost here. I would be happy to provide more details if needed, and would appreciate any help or insights you might have. Major parts of my code are attached. Note that the ModelFactory() that I'm loading and using is the pretrained one.
Thanks in advance!
data generator code
rescale = 1./255.0
target_size = (224, 224)
batch_size = 128
train_datagen = ImageDataGenerator(
width_shift_range=0.1,
height_shift_range=0.1,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True,
rescale=rescale
)
train_generator = train_datagen.flow_from_dataframe(
train_csv,
directory=train_path,
x_col='image_name',
y_col='class',
target_size=target_size,
color_mode='rgb',
class_mode='binary',
batch_size=batch_size,
shuffle=True,
)
my model
def get_model():
file_name='/content/brucechou1983_CheXNet_Keras_0.3.0_weights.h5'
base_model = ModelFactory().get_model(class_names=[str(i) for i in range(14)],
weights_path=file_name)
x = base_model.output
x = keras.layers.Dense(1024, activation='relu')(x)
x = keras.layers.BatchNormalization(trainable=True)(x)
predictions = keras.layers.Dense(1, activation='sigmoid')(x)
model = keras.models.Model(inputs=base_model.inputs, outputs=predictions)
for layer in base_model.layers:
layer.trainable = False
model.summary()
return model
training the model
class_weight = sklearn.utils.class_weight.compute_class_weight('balanced',np.unique(train_csv['class']), train_csv['class'])
model.compile(keras.optimizers.SGD(lr=1e-6, decay=1e-6, momentum=0.9, nesterov=True),
loss='binary_crossentropy',
metrics=['binary_accuracy'])
history = model.fit_generator(
train_generator,
steps_per_epoch=len(train_generator),
epochs=10,
verbose=1,
class_weight=class_weight
)

Data-augmentation generators not working with TensorFlow 2.0

I am trying to train model with image data-augmentation generators on TensorFlow 2.0, after downloading Kaggle's cats_vs_dogs dataset using below code.
train_datagen = ImageDataGenerator(rescale=1. / 255,
rotation_range=40,
width_shift_range=0.2,
height_shift_range=0.2,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True)
test_datagen = ImageDataGenerator(rescale=1. / 255)
train_generator = train_datagen.flow_from_directory(train_dir,
target_size=(150, 150),
batch_size=32,
class_mode='binary')
validation_generator = test_datagen.flow_from_directory(validation_dir,
target_size=(150, 150),
batch_size=32,
class_mode='binary')
history = model.fit_generator(train_generator,
steps_per_epoch=100,
epochs=100,
validation_data=validation_generator,
validation_steps=50)
But on first epoch, getting this error:
Found 2000 images belonging to 2 classes.
Found 1000 images belonging to 2 classes.
WARNING:tensorflow:From <ipython-input-18-e571f2719e1b>:27: Model.fit_generator (from tensorflow.python.keras.engine.training) is deprecated and will be removed in a future version.
Instructions for updating:
Please use Model.fit, which supports generators.
WARNING:tensorflow:sample_weight modes were coerced from
...
to
['...']
WARNING:tensorflow:sample_weight modes were coerced from
...
to
['...']
Train for 100 steps, validate for 50 steps
Epoch 1/100
63/100 [=================>............] - ETA: 59s - loss: 0.7000 - accuracy: 0.5000 WARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches (in this case, 10000 batches). You may need to use the repeat() function when building your dataset.
How should I modify the above code base for TensorFlow 2?
The kaggle dataset contain 25000 training examples. The error message states that:
Blockquote tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least steps_per_epoch * epochs batches (in this case, 10000 batches). You may need to use the repeat() function when building your dataset.
Which means the data generator needs to generate at least 10000 batches. But with the current batch size of 32 the generator would produce only 25000 / 32 is approximately equal to 781 batches. My suggestion is try to reduce the steps_per_epoch or epochs and try.
You can get rid of the deprecation message by passing the generator object to model.fit(...) instead of model.fit_generator

Why is my Neural Net not learning

I have a CNN that I'm trying to train and I cant figure out why its not learning. It has 32 classes which are different types of clothes of about 1000 images in each folder.
Issue is this is the result at the end of training which takes about 9 hours on my GPU
loss: 3.3403 - acc: 0.0542 - val_loss: 3.3387 - val_acc: 0.0534
If anyone could give me directions on how to get this network to train better I would be grateful.
# dimensions of our images.
img_width, img_height = 228, 228
train_data_dir = 'Clothes/train'
validation_data_dir = 'Clothes/test'
nb_train_samples = 25061
nb_validation_samples = 8360
epochs = 20
batch_size = 64
if K.image_data_format() == 'channels_first':
input_shape = (3, img_width, img_height)
else:
input_shape = (img_width, img_height, 3)
model = Sequential()
model.add(Conv2D(filters=64, kernel_size=2, padding='same', activation='tanh', input_shape=input_shape))
model.add(MaxPooling2D(pool_size=2))
model.add(Dropout(0.3))
model.add(Conv2D(filters=32, kernel_size=2, padding='same', activation='tanh'))
model.add(MaxPooling2D(pool_size=2))
model.add(Dropout(0.3))
model.add(Flatten())
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(32, activation='softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])
train_datagen = ImageDataGenerator(
rescale=1. / 255,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True)
[test_datagen = ImageDataGenerator(rescale=1. / 255)
train_generator = train_datagen.flow_from_directory(
train_data_dir,
target_size=(img_width, img_height),
batch_size=batch_size,
class_mode='categorical',
shuffle = True)
validation_generator = test_datagen.flow_from_directory(
validation_data_dir,
target_size=(img_width, img_height),
batch_size=batch_size,
class_mode='categorical',
shuffle = True)
history = model.fit_generator(
train_generator,
steps_per_epoch=nb_train_samples // batch_size,
epochs=epochs,
validation_data=validation_generator,
validation_steps=nb_validation_samples // batch_size)
Here is the plot of the training & validation loss
A network may not converge/learn for several reasons, but here is a list of tips that I think is relevant in your case (based on my own experience):
Transfer Learning: The first thing you should know is that it's very hard to train an image classifier from scratch for most problems, you need much more computing power and time for that. I strongly recommend using transfer learning. There are multiple trained architectures available in Keras that you can use as initial weights fr your network (or other methods).
Training step: For the optimizer, I recommend to use Adam first and to vary the learning rate to see how the loss responds. Also, since you are using Convolutional Layers, you should consider adding Batch Normalization Layers, that can speed significantly the training time, and change the Convolutional activations to 'relu', which make them much faster to train.
You could also try decreasing the Dropout values but I don't think that's the main issue here. Also, If you are considering training your network from scratch,
you should start with fewer layers and add more gradually to get a better idea of ​​what's going on.
train/test split: I see that you are using 8360 observations in your test set. Given the size of your training set, I think it's too much. 1000 for example is enough. The more training samples you have, the more satisfying your results will be.
Also, before judging the accuracy of your model, you should start by establishing a baseline model to benchmark your model. The baseline model depends on your problem, but in general I choose a model that predicts the most common class in the dataset. You should also look at another metric 'top_k_accuracy' available in Keras that is interesting when you have a relatively high number of classes to predict. It helps you to see how close your model is to the right prediction.
First, in order to keep your sanity, check carefully for any bugs, and that your data is being sent in as intended
You might want to add a Top K accuracy metric to get a better idea of whether it's close to getting it, or totally wrong.
Here are some tuning things to try:
Change the kernel size to 3 and activation to relu
model.add(Conv2D(filters=64, kernel_size=3, padding='same', activation='relu'))
If you think your model is underfitting then try increasing the number of Conv layers per pooling to start with. But you could also increase the number of filters or the number of conv + pool repetitions.
Adam optimizer might learn a bit faster than RMS prop
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
Probably the biggest improvement would be to get more data. I think your data set is probably too small for the scope of the problem.
You might want to try transfer learning from a pre-trained image recognition network.