Error when running LSTM model, Loss: NaN values - tensorflow

My LSTM model using Keras and Tensorflow is giving loss: nan values.
I have tried to reduce the learning rate but still get nan and decreasing overall accuracy, and have also used np.any(np.isnan(x_train)) to check for nan values that I may be introducing myself (no nan's were found). I also read about exploding gradients and cant seem to find anything to help with my specific issue.
I think I have an idea of where the issue may be but not quite sure. This is the process I implemented to build x_train
For example:
a = [[1,0,..0], [0,1,..0], [0,0,..1]]
a.shape() # (3, 20)
b = [[0,0,..1], [0,1,..0], [1,0,..0], [0,1,..0]]
b.shape() # (4, 20)
To ensure that the shapes are the same I append a vector [0,0,..0] (all zero's) to a so the shape is now (4,20).
a and b is appended to give a 3D array shape (2,4,20)and this forms x_train. But I think appending the empty vectors of 0's is for some reason giving me a loss: nan whilst training my model. Is this where I could be going wrong?
n.b. a+b is a numpy array and my actual x_train.shape is (1228, 1452, 20)
•Edit• model.summary() added below:
x_train shape: (1228, 1452, 20)
y_train shape: (1228, 1452, 8)
x_val shape: (223, 1452, 20)
x_val shape: (223, 1452, 8)
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
unified_lstm (UnifiedLSTM) (None, 1452, 128) 76288
_________________________________________________________________
batch_normalization_v2 (Batc (None, 1452, 128) 512
_________________________________________________________________
unified_lstm_1 (UnifiedLSTM) (None, 1452, 128) 131584
_________________________________________________________________
batch_normalization_v2_1 (Ba (None, 1452, 128) 512
_________________________________________________________________
dense (Dense) (None, 1452, 32) 4128
_________________________________________________________________
dense_1 (Dense) (None, 1452, 8) 264
=================================================================
Total params: 213,288
Trainable params: 212,776
Non-trainable params: 512
Screenshot of nan is below:

Solution is to use Masking() layers available in keras with mask_value=0. This is because when using empty vectors they are calculated into the loss, by using Masking(), as outlined by keras the padding vectors are skipped and not included.
As per keras documentation:
'If all features for a given sample timestep are equal to mask_value, then the sample timestep will be masked (skipped) in all downstream layers (as long as they support masking)'

I will advise you check the following:-
The output of your Batch Normalisation Layer. Once, I encountered a similar problem, where loss was coming out to be "nan". When I checked the Normalization output, it's was all zeros. Maybe, that's what made loss to be "nan".
The Possible reason for NaNs could be too high of a learning rate. Try reducing it bit and check the output.
If you are using RMSProp, try Adam instead.
As your dense_1 layer has shape of (None, 8), I am assuming you are working on some sort of classification problem. Because, we use log loss in here, sometimes,
precision errors also come into play. If you are using float16, change the precision to float32.

Instead of padding all zeros vector, you should use a dummy feature. That is, your one-hot feature vector will increase size to (21,), e.g., [0, 0, 0, ..., 1] of size 21 with the last dimension for dummy padding.
I also advise you to use index-based input instead of explicit one-hot vector, where each one-hot vector can be replaced by the index of its 1, e.g., [0, 0, 1, ..., 0] becomes 2. Keras support this index-based input style with its embedding layer. This will be easier to use and more computationally efficient.

Related

How to split dataset to implement svm classifier after extracting features from Inception v3 transfer learning?

The training dataset consists of 42848 images in 4 (classes) subdirectories.
image_size= [520,578]
BATCH_SIZE= 32
Model:
inception = InceptionV3(input_shape=CROP_SHAPE + [3], weights='imagenet', include_top=False)
for layer in inception.layers:
layer.trainable = False
x = inception.output
x = GlobalAveragePooling2D()(x)
x = Dense(1024, activation='relu')(x)
prediction = Dense(len(folders), activation='softmax')(x)
Here's the model summary.
Model: "model"
_________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_1 (InputLayer) [(None, 520, 578, 3 0 []
)]
conv2d (Conv2D) (None, 259, 288, 32 864 ['input_1[0][0]']
)
batch_normalization (BatchNorm (None, 259, 288, 32 96 ['conv2d[0][0]']
alization) )
mixed9_1 (Concatenate) (None, 14, 16, 768) 0 ['activation_87[0][0]',
'activation_88[0][0]']
concatenate_1 (Concatenate) (None, 14, 16, 768) 0 ['activation_91[0][0]',
'activation_92[0][0]']
activation_93 (Activation) (None, 14, 16, 192) 0 ['batch_normalization_93[0][0]']
mixed10 (Concatenate) (None, 14, 16, 2048 0 ['activation_85[0][0]',
) 'mixed9_1[0][0]',
'concatenate_1[0][0]',
'activation_93[0][0]']
global_average_pooling2d (Glob (None, 2048) 0 ['mixed10[0][0]']
alAveragePooling2D)
dense (Dense) (None, 1024) 2098176 ['global_average_pooling2d[0][0]'
]
dropout (Dropout) (None, 1024) 0 ['dense[0][0]']
Feature_extractor (Dense) (None, 64) 65600 ['dropout[0][0]']
dense_1 (Dense) (None, 4) 260 ['Feature_extractor[0][0]']
==================================================================================================
Total params: 23,966,820
Trainable params: 2,164,036
Non-trainable params: 21,802,784
__________________________________
I've extracted features from Inception v3 model. Now I want to split the features using sci-kit learn to train SVM classifier.
from sklearn.model_selection import train_test_split
model_feat = Model(inputs=loaded_model.input,outputs=loaded_model .get_layer('Feature_extractor').output)
feat_trainX = model_feat.predict(train_data)
...
X_train2, X_test2, y_train2, y_test2 = train_test_split(feat_trainX, train_gen, test_size=0.25, random_state=42)
Here, the "traingen" directory iterator is used as a "label" but getting inconsistent values. I've also got same problems using "tf.keras.utils.image_dataset_from_directory".
from tensorflow.keras.preprocessing.image import ImageDataGenerator
train_datagen = ImageDataGenerator(rescale=1./255)
train_gen = train_datagen.flow_from_directory(
'/content/DeepSeagrass/Training',
target_size=CROP_SHAPE,
batch_size=BATCH_SIZE,
shuffle=True,
class_mode='categorical')
Found 42848 images belonging to 4 classes.
Is there any solution to labeling large image dataset for training discriminative algorithms?
It seems that flow_from_directory and tf.keras.utils.image_dataset_from_directory both takes the data as small batch randomly which may cause inconsistency between features and label.
Now, the main question is how tackle this situation by using less gpu. I'm using google colab which gets out of memory on gpu runtime if I want to convert the whole dataset into numpy array.
Also, if yes then how to save the features with label in CSV file for further visualization like TSNE plot?
Below references I've found but neither gave exact solution:
How to store CNN extracted features to train a SVM classifier
How to implement t-SNE in tensorflow?
https://pyimagesearch.com/2019/05/27/keras-feature-extraction-on-large-datasets-with-deep-learning/
Actually, the issue is you're using flow_from_directory() with batch_size smaller than the entire input, which is why it's only producing 1339 elements at a time (because it's in batches). The number of items in Dataset created from flow_from_directory() is total_number / batch_number. Either set the batch number with flow_from_directory() to be one (1), or load all the images into memory before train_test_split(). Either one of those will load everything in memory.
It's also worth noting train_gen is not an array, it's a Dataset, which means it contains x and y values. You most likely aren't trying to get both feature and labels there, right? If you need one or the other, you'll need to debug and look into that variable. Whenever I do, they contain arrays x and y, corresponding to features and labels, respectively. Some people don't see it named that way. You could technically do it this way, but like my comment, given the random nature of flow_from_directory(), it probably won't be as robust as loading all in memory (because some could get skipped, some could get picked multiple times).

How Keras can calculate the number of parameters at early stage when there are still None dimensions?

Sorry for the very basic question (I'm new with Keras). I was wondering how Keras can calculate for each layer the number of parameters at an early stage (before fit) despite that model.summary shows that there are dimensions that still have None values at this stage. Are these values already determined in some way and if yes, why not show them in the summary?
I ask the question because I'm having a hard time figure out my "tensor shape bug" (I'm trying to determine the output dimensions of the the C5 block of my resnet50 model but I cannot see them in model.summary even if I see the number of parameters).
I give below an example based on C5_reduced layer in RetinaNet which is fed by C5 layer of Resnet50. The C5_reduced is
Conv2D(256,kernel_size=1,strides=1,pad=1)
Based on model.summary for this particular layer:
C5_reduced (Conv2D) (None, None, None, 256) 524544
I've made the guess that C5 is (None,1,1,2048) because 2048*256+256 = 524544 (I don't know how to confirm or infirm that hypothesis). So if it's already known, why not show it on summary? If dimensions 2 and 3 would have been different, the number of parameters would have been different too right?
If you pass exact input shape to your very first layer or input layer on your network, you will have the output that you want. For instance I used input layer here:
input_1 (InputLayer) [(None, 224, 224, 3)] 0
_________________________________________________________________
block1_conv1 (Conv2D) (None, 224, 224, 64) 1792
_________________________________________________________________
block1_conv2 (Conv2D) (None, 224, 224, 64) 36928
Passed input as (224,224,3). 3 represents the depth here. Note that convolutional parameters' calculation differ from Dense layers' calculation.
If you do such following:
tf.keras.layers.Conv2D(16, (3,3), activation='relu', input_shape=(150, 150, 3))
You will see:
conv2d (Conv2D) ---> (None, 148, 148, 16)
Dimensions reduced to 148x148, in Keras padding is valid by default. Also strides is 1. Then the shape of output will be 148 x 148. (You can search for the formula.)
So then what are None values?
First None value is the batch size. In Keras first dimension is the batch size. You can pass them and make fixed, or you can determine them while fitting the model, or predicting.
In 2D convolution, the expected input is (batch_size, height, width, channels), you can also have shapes such as (None, None, None, 3), that means varying image sizes are allowed.
Edit:
tf.keras.layers.Input(shape = (None, None, 3)),
tf.keras.layers.Conv2D(16, (3,3), activation='relu')
Produces:
conv2d_21 (Conv2D) (None, None, None, 16) 448
Regarding to your question, how are the parameters calculated even we passed image height & width as None?
Convolution parameters calculated according to:
(filter_height * filter_width * input_image_channels + 1) * number_of_filters
When we put them into formula,
filter_height = 3
filter_width = 3
input_image_channel = 3
number_of_filters = 16
Parameters = (3 x 3 x 3 + 1) * 16 = 28 * 16 = 448
Notice, we only needed input_image's channel number which is 3, representing that it is an RGB image.
If you want to calculate the params for later convolutions, you need to consider that the number of filters from previous layer becomes the number of channels for current layer's channel.
That's how you can end up having None params rather than batch_size. Keras needs to know if your image is RGB or not in that case. Or you won't specify the dimensions while creating the model and can pass them while fitting the model with the dataset.
You need to define an input layer for your model. The total number of trainable parameters is unknown until you either a) compile the model and feed it data, at which point the model makes a graph based on the dimensions of the input and you will then be able to determine the number of params, or b) you define an input layer for the model with the input dimensions stated, then you can find the number of params with model.summary().
The point is that the model cannot know the number of parameters between the input and first hidden layer until it is defined, or you run inference and give it the shape of the input.

How to calculate the number of multiplications happening in BatchNormalization layer during test evaluation?

or, why do my CNN's test evaluations take significantly longer with BatchNormalization than without?
I need to approximate the theoretical runtime for the evaluation of a trained CNN (using Keras with TF backend) on a test set. Thus, I attempted to calculate the number of mutliplications happening during evaluation to use this as a metric.
But for some reason, Batch Normalization (BN) appears to have a significant impact on the evaluation time, despite not being relevant in theory in my understanding.
I can calculate the number of multiplications for Dense and Conv Layers, and I thought I can ignore the computations for the activation function and the Batch Normalization as both only add one multiplication per Input, which is significantly less than what the Convolutional Layers do.
However, when I test the same network once with and once without Batch Normalization after every ConvLayer, I noticed that I cannot ignore it:
In the simple example given below, there only is one ConvLayer with filter size (3x3), followed by a softmax activated dense layer as I'm doing classification.
With BN after the conv layer, it takes me ~4.6 seconds to work through the test set.
Using the otherwise exact same net architecture without BN, the same test set is processed in half the time.
Summary of the test configuration with BN (finishes test set evaluation in ~4.6s):
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_1 (Conv2D) (None, 32, 32, 32) 896
_________________________________________________________________
batch_normalization_1 (Batch (None, 32, 32, 32) 128
_________________________________________________________________
flatten_1 (Flatten) (None, 32768) 0
_________________________________________________________________
dense_1 (Dense) (None, 43) 1409067
=================================================================
Total params: 1,410,091
Trainable params: 1,410,027
Non-trainable params: 64
Without BN (finishes test set evaluation in ~2.3s):
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_2 (Conv2D) (None, 32, 32, 32) 896
_________________________________________________________________
flatten_2 (Flatten) (None, 32768) 0
_________________________________________________________________
dense_2 (Dense) (None, 43) 1409067
=================================================================
Total params: 1,409,963
Trainable params: 1,409,963
Non-trainable params: 0
I don't know how this scales as I don't understand the cause in the first place, but I can tell that I had tested other nets with 3 to 6 identical conv layers (using padding = same to keep the dimensions constant), and the difference in test evaluation appeared to vary between ~25% to ~50% in most cases (The one-conv-layer example given below even has ~100%).
Why does BN have such a big impact, in other words, what calculations are happening that I'm missing?
I thought: BN just adds one multiplication per input. So, for example in the network with BN given above:
I expected batch_normalization_1 for example would add 32*32*32 multiplications, and conv2d_1 32*32*32*3*3 multiplications.
But then, how does that have so much impact on the overall runtime, even though the ConvLayers add more multiplications?
Code used to build the model:
model = Sequential()
model.add(Conv2D(32, (3, 3), activation="relu", input_shape=x_train.shape[1:], padding="same"))
model.add(BatchNormalization())
model.add(Flatten())
model.add(Dense(43, activation='softmax'))
with x_train.shape[1:] being (32, 32, 3), representing a 32x32 image with RGB colors.
Kind of answering my own question here, in case anyone stumbles across the same issue.
By embedding the Fritz AI benchmark libary https://docs.fritz.ai/python-library/benchmark.html, I could actually check the number of flops per layer, and it indeed turned out that the normalization only adds a neglectable amount of computations.
----------------------------------------------------------------------------------------------------------------------
Layer (type) Output Shape MFLOPS Weights Core ML Compatible
======================================================================================================================
conv2d_1 (Conv2D) [None, 32, 32, 32] 0.92 896 True
----------------------------------------------------------------------------------------------------------------------
batch_normalization_1 (BatchNormalization) [None, 32, 32, 32] 0.07 128 True
----------------------------------------------------------------------------------------------------------------------
flatten_1 (Flatten) [None, 32768] 0.00 0 True
----------------------------------------------------------------------------------------------------------------------
dense_1 (Dense) [None, 43] 2.82 1,409,067 True
----------------------------------------------------------------------------------------------------------------------
That said, the issue must be caused by some inefficent routine or even a bug in keras to evaluate models with Batch Normalization. Weird, but that is the only possible explanation.

Can I use transfer learning to retrain a Neural Network on different subsets of the data to solve memory problems?

I am trying to train a Neural Network on the Amazon Reviews DataSet, so that I can teach it to classify correctly between Positive and Negative sentiment. The approach I am trying to use is to first use Google's Word2Vec model to vectorize each review, by sampling the vector from the model. Then, I feed them into a Convolutional Neural Network to train it.
I obtained Google's Pre-trained Word2Vec model from here, which gives me a Vector of length 300-dimensions, and by truncating each review to 80 words, I obtain a 80 x 300 Vector for each review.
The Convolutional Neural Network I train has the following structure:
Layer (type) Output Shape
- conv2d_1 (Conv2D) (None, 1, 300, 128)
- conv2d_2 (Conv2D) (None, 1, 300, 64)
- conv2d_3 (Conv2D) (None, 1, 300, 32)
- conv2d_4 (Conv2D) (None, 1, 300, 16)
- flatten_1 (Flatten) (None, 4800)
- dropout_1 (Dropout 0.5) (None, 4800)
- dense_1 (Dense) (None, 256)
- batch_normalization_1 (Batch (None, 256)
- activation_1 (Relu) (None, 256)
- dropout_2 (Dropout 0.5) (None, 256)
- dense_2 (Dense) (None, 1)
I use a large Network with big dropout and neurons to reduce overfitting on the Training data.
However, my main problem is that I am unable to train on most of the data because I can't load all of the data in memory, and since the Featurized vectors contain mostly high-precision decimals, they take up a lot of memory, and disk space if I serialize them.
Is it possible for me to use Transfer Learning to solve the problem of not training on enough data? The approach I plan on using is:
Load a subset of the dataset that can fit into the memory
Vectorize it using Google's Word2Vec model (This part takes around 5-10 minutes)
Train the model for 50-100 epochs
Load in a second subset of the dataset and repeat.
Is this a valid approach for training a large model? Because I am re-training the model on the same dataset, am I correct in assuming that I won't have to freeze any layers?
Also, is Stochastic Gradient Descent a good optimizer for this problem, since I will be training on a large amount of data?
Instead of your proposed method, I think it would be more appropriate if you used a data generator.

Training a fully convolutional neural network with inputs of variable size takes unreasonably long time in Keras/TensorFlow

I am trying to implement a FCNN for image classification that can accept inputs of variable size. The model is built in Keras with TensorFlow backend.
Consider the following toy example:
model = Sequential()
# width and height are None because we want to process images of variable size
# nb_channels is either 1 (grayscale) or 3 (rgb)
model.add(Convolution2D(32, 3, 3, input_shape=(nb_channels, None, None), border_mode='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Convolution2D(32, 3, 3, border_mode='same'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Convolution2D(16, 1, 1))
model.add(Activation('relu'))
model.add(Convolution2D(8, 1, 1))
model.add(Activation('relu'))
# reduce the number of dimensions to the number of classes
model.add(Convolution2D(nb_classses, 1, 1))
model.add(Activation('relu'))
# do global pooling to yield one value per class
model.add(GlobalAveragePooling2D())
model.add(Activation('softmax'))
This model runs fine but I am running into a performance issue. Training on images of variable size takes unreasonably long time compared to training on the inputs of fixed size. If I resize all images to the maximum size in the data set it still takes far less time to train the model than training on the variable size input. So is input_shape=(nb_channels, None, None) the right way to specify variable size input? And is there any way to mitigate this performance problem?
Update
model.summary() for a model with 3 classes and grayscale images:
Layer (type) Output Shape Param # Connected to
====================================================================================================
convolution2d_1 (Convolution2D) (None, 32, None, None 320 convolution2d_input_1[0][0]
____________________________________________________________________________________________________
activation_1 (Activation) (None, 32, None, None 0 convolution2d_1[0][0]
____________________________________________________________________________________________________
maxpooling2d_1 (MaxPooling2D) (None, 32, None, None 0 activation_1[0][0]
____________________________________________________________________________________________________
convolution2d_2 (Convolution2D) (None, 32, None, None 9248 maxpooling2d_1[0][0]
____________________________________________________________________________________________________
maxpooling2d_2 (MaxPooling2D) (None, 32, None, None 0 convolution2d_2[0][0]
____________________________________________________________________________________________________
convolution2d_3 (Convolution2D) (None, 16, None, None 528 maxpooling2d_2[0][0]
____________________________________________________________________________________________________
activation_2 (Activation) (None, 16, None, None 0 convolution2d_3[0][0]
____________________________________________________________________________________________________
convolution2d_4 (Convolution2D) (None, 8, None, None) 136 activation_2[0][0]
____________________________________________________________________________________________________
activation_3 (Activation) (None, 8, None, None) 0 convolution2d_4[0][0]
____________________________________________________________________________________________________
convolution2d_5 (Convolution2D) (None, 3, None, None) 27 activation_3[0][0]
____________________________________________________________________________________________________
activation_4 (Activation) (None, 3, None, None) 0 convolution2d_5[0][0]
____________________________________________________________________________________________________
globalaveragepooling2d_1 (Global (None, 3) 0 activation_4[0][0]
____________________________________________________________________________________________________
activation_5 (Activation) (None, 3) 0 globalaveragepooling2d_1[0][0]
====================================================================================================
Total params: 10,259
Trainable params: 10,259
Non-trainable params: 0
I think #marcin-możejko may have the right answer in his comment.
It may be related to this bug, which was just fixed. And this patch may warn you if things are being compiled too often.
So upgrading to a tf-nightly-gpu-2.0-preview package may fix this.
Also do you get this problem with tf.keras.
If I resize all images to the maximum size in the data set it still takes far less time to train the model than training on the variable size input
Note that for basic convolutions with "same" padding, zero padding should have "no" effect on the output, aside from pixel alignment.
So one approach would be to train on a fixed list of sizes and zero pad images to those sizes. For example and train on batches of 128x128, 256x256, 512x512. If you can't fix the dynamic compilation thing this at least would only compile it 3 times. This would be a bit like a 2d "bucket-by-sequence-length" approach sometimes seen with sequence models.
Images of different sizes implies images of similar things at a different scale. If this difference in scale is significant the relative position of the similar things will shift from being in the centre of the frame towards the top left as the image size reduces. The (simple) network architecture shown is spatially aware so it would be consistent for the rate of model convergence to degrade as data of a very different scale would be inconsistent. This architecture is not well suited to finding the same thing in different or multiple places.
A certain degree of shearing, rotation, mirroring would help the model generalise, but re-scaled to a consistent size. So, when you re-size you fix the scaling issue and make the input data spatially consistent.
In short, I think it’s that this network architecture is not suited / capable for the task you are giving it i.e. various scales.