I am working on an image-segmentation application where the loss function is Dice loss. The issue is the the loss function becomes NAN after some epochs. I am doing 5-fold cross validation and checking validation and training losses for each fold. For some folds, the loss quickly becomes NAN and for some folds, it takes a while to reach it to NAN. I have inserted a constant in loss function formulation to avoid over/under-flow but still it the same problem occurs. My inputs are scaled within range [-1, 1]. I have seen people suggested using regularizers and different optimizers but I dont understand why the loss gets to NAN at first place. I have pasted the loss function, and training and validation losses for some epochs below. Initially only the validation loss and dice score for validation loss becomes NAN, but later all metrics becomes NAN.
def dice_loss(y_true, y_pred): #y_true--> ground-truth, y_pred-->predictions
smooth=1.
y_true_f = tf.keras.backend.flatten(y_true)
y_pred_f = tf.keras.backend.flatten(y_pred)
intersection = tf.keras.backend.sum(y_true_f * y_pred_f)
return 1-(2. * intersection +smooth) / (tf.keras.backend.sum(y_true_f) +
tf.keras.backend.sum(y_pred_f) +smooth)
epoch train_dice_score train_loss val_dice_score val_loss
0 0.42387727 0.423877264 0.35388064 0.353880603
1 0.23064087 0.230640889 0.21502239 0.215022382
2 0.17881058 0.178810576 0.1767999 0.176799848
3 0.15746565 0.157465705 0.16138957 0.161389555
4 0.13828343 0.138283484 0.12770002 0.127699989
5 0.10434002 0.104340041 0.0981831 0.098183098
6 0.08013707 0.080137035 0.08188484 0.081884826
7 0.07081806 0.070818066 0.070421465 0.070421467
8 0.058371827 0.058371854 0.060712796 0.060712777
9 0.06381426 0.063814262 nan nan
10 0.105625264 0.105625251 nan nan
11 0.10790708 0.107907102 nan nan
12 0.10719114 0.10719115 nan nan
I was getting same problem with my segmentation model too. I got that problem when I use both of dice loss and weighted cross entropy loss. I found a solution if somebody still has a same problem.
I was focusing my custom loss but then I figure out nan value came from inside of model when calculation time. Because of relu, inner values becomes to high then become nan.
To solve this I use batch normalization after every convolution with relu and it worked for me.
Related
CONTEXT
I have a dataframe of monthly historical prices of market indices like so (all data comes from Bloomberg):
MSCI World S&P 500 ... HFRX Event Driven Gold Spot
1969-12-31 100 92.06 ... NaN NaN
1970-01-30 94.25 85.02 ... NaN NaN
... ... ... ... ... ...
2021-07-31 3141.35 4395.26 ... 20459.292 143.77
2021-08-31 3006.6 4522.68 ... 20614.276 134.06
I want to predict the value of each index for the next month with an LSTM NN (each index has its specially trained NN).
So a new LSTM model is initialized and trained on each of these time series (which all have from 300 to 1200 samples). This (Pytorch) LSTM model is the following:
class LSTMRegressor(nn.Module):
def __init__(self, input_size, hidden_size,sequence_size,num_layers,dropout):
super(LSTMRegressor,self).__init__()
self.input_size = input_size
self.hidden_size = hidden_size
self.sequence_size = sequence_size
self.num_layers=num_layers
self.droput = dropout
self.lstm = nn.LSTM(
input_size=self.input_size,
hidden_size=self.hidden_size,
num_layers=num_layers,
batch_first=True,
dropout=dropout)
self.linear = nn.Linear(in_features=hidden_size, out_features=1)
def forward(self, x):
lstm_out, self.hidden = self.lstm(x)
y_pred = self.linear(lstm_out[:,-1,:])
return y_pred
Loss function and optimizer:
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(),lr=learning_rate)
My parameters are the following:
input_size = 1
hidden_size=150
num_layers=2
dropout=0
batch_size = 16
learning_rate = 0.001
RESULTS
For most of the indexes, the training seems to work well as there is only about a mean error of 0.5% in the testing set (see an exemple in first graph below). However, for some of the indexes, the training does not work (about 100% error) (see an exemple in second graph below).
The graphs show training/validation loss and mape (mean average percentage error). The vertical red line is simply the best epoch calculated by an "early stopping algorithm".
Model that trained successfully (test == validation):
Model that trained unsuccessfully (test == validation):
QUESTIONS
Why do all LSTM models seem not to overfit (I've tested with ten of thousands of epochs)?
Why do some LSTM models do not train proprely? (they are not those with the least data)
Why do the models that do not train proprely have such smooth curves?
Thank you very much for your help!
Third case in two days!
I have the following dataframe snippet (where the columns were originally multi-indexed, but after saving the df to CSV and reading it back in I lost the indexing and the second level is actually a row):
edited:
model model_a model_a model_b model_b
NaN b pvalue b pvalue
predictor NaN NaN NaN NaN
aches 0.6991801946 0.33372434223 0.3523114106 0.0359096002
cough 0.7164202952 0.00796337569 0.7405228672 0.0473180859
My use case is to now transpose the predictors ache and cough as columns and the top level of the indexed columns as rows, such that if the pvalue is <= 0.05, the cell should be null, otherwise the value of b should be in that cell. I assume the evaluation of the pvalue will be in a lambda function, but I may be wrong!
The desired dataframe thus would be:
model aches cough
model_a NaN 0.7164202952
model_b 0.3523114106 0.7405228672
To be honest, I have absolutely no idea how to do this, let alone how to begin. Any assistance would be most appreciated.
Try with pd.IndexSlice with where
out = df.loc[:,pd.IndexSlice[:,'b']].where(df.loc[:,pd.IndexSlice[:,'pvalue']].values<0.05).T.reset_index(level=1,drop=True)
aches cough
ma NaN 0.716420
mb 0.352311 0.740523
I was wondering how to penalize less represented classes more then other classes when dealing with a really imbalanced dataset (10 classes over about 20000 samples but here is th number of occurence for each class : [10868 26 4797 26 8320 26 5278 9412 4485 16172 ]).
I read about the Tensorflow function : weighted_cross_entropy_with_logits (https://www.tensorflow.org/api_docs/python/tf/nn/weighted_cross_entropy_with_logits) but I am not sure I can use it for a multi label problem.
I found a post that sum up perfectly the problem I have (Neural Network for Imbalanced Multi-Class Multi-Label Classification) and that propose an idea but it had no answers and I thought the idea might be good :)
Thank you for your ideas and answers !
First of all, there is my suggestion you can modify your cost function to use in a multi-label way. There is code which show how to use Softmax Cross Entropy in Tensorflow for multilabel image task.
With that code, you can multiple weights in each row of loss calculation. Here is the example code in case you have multi-label task: (i.e, each image can have two labels)
logits_split = tf.split( axis=1, num_or_size_splits=2, value= logits )
labels_split = tf.split( axis=1, num_or_size_splits=2, value= labels )
weights_split = tf.split( axis=1, num_or_size_splits=2, value= weights )
total = 0.0
for i in range ( len(logits_split) ):
temp = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits( logits=logits_split[i] , labels=labels_split[i] ))
total += temp * tf.reshape(weights_split[i],[-1])
I think you can just use tf.nn.weighted_cross_entropy_with_logits for multiclass classification.
For example, for 4 classes, where the ratios to the class with the largest number of members are [0.8, 0.5, 0.6, 1], You would just give it a weight vector in the following way:
cross_entropy = tf.nn.weighted_cross_entropy_with_logits(
targets=ground_truth_input, logits=logits,
pos_weight = tf.constant([0.8,0.5,0.6,1]))
So I am not entirely sure that I understand your problem given what you have written. The post you link to writes about multi-label AND multi-class, but that doesn't really make sense given what is written there either. So I will approach this as a multi-class problem where for each sample, you have a single label.
In order to penalize the classes, I implemented a weight Tensor based on the labels in the current batch. For a 3-class problem, you could eg. define the weights as the inverse frequency of the classes, such that if the proportions are [0.1, 0.7, 0.2] for class 1, 2 and 3, respectively, the weights will be [10, 1.43, 5]. Defining a weight tensor based on the current batch is then
weight_per_class = tf.constant([10, 1.43, 5]) # shape (, num_classes)
onehot_labels = tf.one_hot(labels, depth=3) # shape (batch_size, num_classes)
weights = tf.reduce_sum(
tf.multiply(onehot_labels, weight_per_class), axis=1) # shape (batch_size, num_classes)
reduction = tf.losses.Reduction.MEAN # this ensures that we get a weighted mean
loss = tf.losses.softmax_cross_entropy(
onehot_labels=onehot_labels, logits=logits, weights=weights, reduction=reduction)
Using softmax ensures that the classification problem is not 3 independent classifications.
i started writing Neuronal Networks with tensorflow and there is one Problem i seem to face in each of my example Projects.
My loss allways starts at something like 50 or higher and does not decrease or if it does, it does so slowly that after all my epochs i do not even get near an acceptable loss-rate.
Things it already tried (and did not affect the result very much)
tested on overfitting, but in the following example
you can see that i have 15000 training and 15000 testing-datasets and
something like 900 neurons
tested different optimizers and optimizer-values
tried increasing the traingdata by using the testdata as
trainingdata aswell
tried increasing and decreasing the batchsize
I created the network on knowledge of https://youtu.be/vq2nnJ4g6N0
But let us have a look on one of my testprojects:
I have a list of names and wanted to assume the gender so my raw data looks like this:
names=["Maria","Paul","Emilia",...]
genders=["f","m","f",...]
For feeding it into the network i transform the names into an array of charCodes (expecting a maxlength of 30) and the gender into a bit array
names=[[77.,97. ,114.,105.,97. ,0. ,0.,...]
[80.,97. ,117.,108.,0. ,0. ,0.,...]
[69.,109.,105.,108.,105.,97.,0.,...]]
genders=[[1.,0.]
[0.,1.]
[1.,0.]]
I built the network with 3 hidden layers [30,20],[20,10],[10,10] and [10,2] for the output layer. All hidden layers have a ReLU as activation function. The output layer has a softmax.
# Input Layer
x = tf.placeholder(tf.float32, shape=[None, 30])
y_ = tf.placeholder(tf.float32, shape=[None, 2])
# Hidden Layers
# H1
W1 = tf.Variable(tf.truncated_normal([30, 20], stddev=0.1))
b1 = tf.Variable(tf.zeros([20]))
y1 = tf.nn.relu(tf.matmul(x, W1) + b1)
# H2
W2 = tf.Variable(tf.truncated_normal([20, 10], stddev=0.1))
b2 = tf.Variable(tf.zeros([10]))
y2 = tf.nn.relu(tf.matmul(y1, W2) + b2)
# H3
W3 = tf.Variable(tf.truncated_normal([10, 10], stddev=0.1))
b3 = tf.Variable(tf.zeros([10]))
y3 = tf.nn.relu(tf.matmul(y2, W3) + b3)
# Output Layer
W = tf.Variable(tf.truncated_normal([10, 2], stddev=0.1))
b = tf.Variable(tf.zeros([2]))
y = tf.nn.softmax(tf.matmul(y3, W) + b)
Now the calculation for the loss, accuracy and the training operation:
# Loss
cross_entropy = -tf.reduce_sum(y_*tf.log(y))
# Accuracy
is_correct = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(is_correct, tf.float32))
# Training
train_operation = tf.train.GradientDescentOptimizer(0.01).minimize(cross_entropy)
I train the network in batches of 100
sess = tf.Session()
sess.run(tf.global_variables_initializer())
for i in range(150):
bs = 100
index = i*bs
inputBatch = inputData[index:index+bs]
outputBatch = outputData[index:index+bs]
sess.run(train_operation, feed_dict={x: inputBatch, y_: outputBatch})
accuracyTrain, lossTrain = sess.run([accuracy, cross_entropy], feed_dict={x: inputBatch, y_: outputBatch})
if i%(bs/10) == 0:
print("step %d loss %.2f accuracy %.2f" % (i, lossTrain, accuracyTrain))
And i get the following result:
step 0 loss 68.96 accuracy 0.55
step 10 loss 69.32 accuracy 0.50
step 20 loss 69.31 accuracy 0.50
step 30 loss 69.31 accuracy 0.50
step 40 loss 69.29 accuracy 0.51
step 50 loss 69.90 accuracy 0.53
step 60 loss 68.92 accuracy 0.55
step 70 loss 68.99 accuracy 0.55
step 80 loss 69.49 accuracy 0.49
step 90 loss 69.25 accuracy 0.52
step 100 loss 69.39 accuracy 0.49
step 110 loss 69.32 accuracy 0.47
step 120 loss 67.17 accuracy 0.61
step 130 loss 69.34 accuracy 0.50
step 140 loss 69.33 accuracy 0.47
What am i doing wrong?
Why does it start at ~69 in my Project and not lower?
Thank you very much guys!
There's nothing wrong with 0.69 nats of entropy per samples, as a starting point for a binary classification.
If you convert to base 2, 0.69/log(2), you'll see that it's almost exactly 1 bit per sample which is exactly what you would expect if you're unsure about a binary classification.
I usually use the mean loss instead of the sum so things are less sensitive to batch size.
You should also not calculate the entropy directly yourself, because that method breaks easily. you probably want tf.nn.sigmoid_cross_entropy_with_logits.
I also like starting with the Adam Optimizer instead of pure gradient descent.
Here are two reasons you might be having some trouble with this problem:
1) Character codes are ordered, but the order doesn't mean anything. Your inputs would be easier for the network to take as input if they were input as one-hot vectors. So your input would be a 26x30 = 780 element vector. Without that the network has to waste a bunch of capacity learning the boundaries between letters.
2) You've only got fully connected layers. This makes it impossible for it to learn a fact independent of it's absolute position in the name. 6 of the top 10 girls names in 2015 ended in 'a', while 0 of the top 10 boys names did. As currently written your network needs to re-learn "Usually it's a girl's name if it ends in 'a'" independently for each name length. Using some convolution layers would allow it to learn facts once across all name lengths.
Below is the RNN that I build using Keras:
def RNN_keras(feat_num, timestep_num=100):
model = Sequential()
model.add(BatchNormalization(input_shape=(timestep_num, feat_num)))
model.add(LSTM(input_shape=(timestep_num, feat_num), output_dim=512, activation='relu', return_sequences=True))
model.add(BatchNormalization())
model.add(LSTM(output_dim=128, activation='relu', return_sequences=True))
model.add(BatchNormalization())
model.add(TimeDistributed(Dense(output_dim=1, activation='linear'))) # sequence labeling
rmsprop = RMSprop(lr=0.00001, rho=0.9, epsilon=1e-08)
model.compile(loss='mean_squared_error',
optimizer=rmsprop,
metrics=['mean_squared_error'])
return model
The output is as following:
61267 in the training set
6808 in the test set
Building training input vectors ...
888 unique feature names
The length of each vector will be 888
Using TensorFlow backend.
Build model...
****** Iterating over each batch of the training data ******
# Each batch has 1280 examples
# The training data are shuffled at the beginning of each epoch.
Epoch 1/3 : Batch 1/48 | loss = 607.043823 | root_mean_squared_error = 24.638334
Epoch 1/3 : Batch 2/48 | loss = 14479824582732.208323 | root_mean_squared_error = 3805236.468701
Epoch 1/3 : Batch 3/48 | loss = nan | root_mean_squared_error = nan
Epoch 1/3 : Batch 4/48 | loss = nan | root_mean_squared_error = nan
Epoch 1/3 : Batch 5/48 | loss = nan | root_mean_squared_error = nan
......
The loss goes very high in the second batch and then becomes nan. The true outcome y does not contains very large values. The max y is less than 400.
On the other hand, I check the prediction output y_hat. The RNN returns some very high prediction, which leads to infinite.
However, I am still puzzled how to improve my model.
The problem is "kind of" solved by 1) changing the activation of the output layer from "linear" to "relu" and/or 2) decreasing the learning rate.
However, the predictions now are all zero.