RNN loss becomes NaN due to very large prediction values - optimization

Below is the RNN that I build using Keras:
def RNN_keras(feat_num, timestep_num=100):
model = Sequential()
model.add(BatchNormalization(input_shape=(timestep_num, feat_num)))
model.add(LSTM(input_shape=(timestep_num, feat_num), output_dim=512, activation='relu', return_sequences=True))
model.add(BatchNormalization())
model.add(LSTM(output_dim=128, activation='relu', return_sequences=True))
model.add(BatchNormalization())
model.add(TimeDistributed(Dense(output_dim=1, activation='linear'))) # sequence labeling
rmsprop = RMSprop(lr=0.00001, rho=0.9, epsilon=1e-08)
model.compile(loss='mean_squared_error',
optimizer=rmsprop,
metrics=['mean_squared_error'])
return model
The output is as following:
61267 in the training set
6808 in the test set
Building training input vectors ...
888 unique feature names
The length of each vector will be 888
Using TensorFlow backend.
Build model...
****** Iterating over each batch of the training data ******
# Each batch has 1280 examples
# The training data are shuffled at the beginning of each epoch.
Epoch 1/3 : Batch 1/48 | loss = 607.043823 | root_mean_squared_error = 24.638334
Epoch 1/3 : Batch 2/48 | loss = 14479824582732.208323 | root_mean_squared_error = 3805236.468701
Epoch 1/3 : Batch 3/48 | loss = nan | root_mean_squared_error = nan
Epoch 1/3 : Batch 4/48 | loss = nan | root_mean_squared_error = nan
Epoch 1/3 : Batch 5/48 | loss = nan | root_mean_squared_error = nan
......
The loss goes very high in the second batch and then becomes nan. The true outcome y does not contains very large values. The max y is less than 400.
On the other hand, I check the prediction output y_hat. The RNN returns some very high prediction, which leads to infinite.
However, I am still puzzled how to improve my model.

The problem is "kind of" solved by 1) changing the activation of the output layer from "linear" to "relu" and/or 2) decreasing the learning rate.
However, the predictions now are all zero.

Related

Is there a way for a Keras model to predict with precision around 14-17 of significant digits?

I've created simple Keras model (prediction, not classification or regression)
inpLayer = tf.keras.layers.Input((1,), dtype=tf.float64)
hiddenLayer = tf.keras.layers.Dense(2, activation=tf.atan, dtype=tf.float64)(inpLayer)
outputLayer = tf.keras.layers.Dense(1, activation='linear', dtype=tf.float64)(hiddenLayer)
model = tf.keras.models.Model(inpLayer, outputLayer)
with Adam optimizer and I used Mean squared error as a loss function and using batch (offline) gradient descent. I have a small dataset (2 "rows").
x = [35., 49.]
y = [0., 1.]
Am I able to train this simple Neural Network on those features and labels? By "train" I mean: When the model predicts, it returns output close to the labels using whole float64. So the output should have 15 to 17 significant decimal digits precision.
Acceptable: Output for 0 would be in range 9.9e-14 to 1.0e-17.
Output for 1 would be in range 1.0e+14 to 9.9e+17.
Is it even possible to train the Neural network like that or it is not?
If so, what am I doing wrong there?
Normalize the data, for example use a normalization layer. You could also increase the number of neurons to get faster good results.
import tensorflow as tf
x = [[35.], [49.]]
y = [[0.], [1.]]
normalization_layer = tf.keras.layers.Normalization()
normalization_layer.adapt(x)
inpLayer = tf.keras.layers.Input((1,), dtype=tf.float64)
norm = normalization_layer(inpLayer)
hiddenLayer = tf.keras.layers.Dense(2, activation=tf.atan, dtype=tf.float64)(norm)
outputLayer = tf.keras.layers.Dense(1, activation='linear', dtype=tf.float64)(hiddenLayer)
model = tf.keras.models.Model(inpLayer, outputLayer)
model.compile(loss='mse', optimizer='adam')
model.fit(x, y, epochs=5000, batch_size=2, verbose=1)
model.evaluate([[35.], [49.]], [[0.], [1.]])
a,b = model.predict([[35.], [49.]])
print(a.astype(str), b.astype(str))
Output:
Epoch 5000/5000
1/1 [==============================] - 0s 2ms/step - loss: 1.1910e-30
1/1 [==============================] - 0s 112ms/step - loss: 1.1910e-30
['9.43689570931383e-16'] ['0.9999999999999988']

Discrepancy between results reported by TensorFlow model.evaluate and model.predict

I've been back and forth with this for ages, but without being able to find a solution so far anywhere. So, I have a HuggingFace model ('bert-base-cased') that I'm using with TensorFlow and a custom dataset. I've: (1) tokenized my data (2) split the data; (3) converted the data to TF dataset format; (4) instantiated, compiled and fit the model.
During training, it behaves as you'd expect: training and validation accuracy go up. But when I evaluate the model on the test dataset using TF's model.evaluate and model.predict, the results are very different. The accuracy as reported by model.evaluate is higher (and more or less in line with the validation accuracy); the accuracy as reported by model.predict is about 10% lower. (Maybe it's just a coincidence, but it's similar to the reported training accuracy after the single epoch of fine-tuning.)
Can anyone figure out what's causing this? I include snippets of my code below.
# tokenize the dataset
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path="bert-base-cased",use_fast=False)
def tokenize_function(examples):
return tokenizer(examples['text'], padding = "max_length", truncation=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
# splitting dataset
trainSize = 0.7
valTestSize = 1 - trainSize
train_testvalid = tokenized_datasets.train_test_split(test_size=valTestSize,stratify_by_column='class')
valid_test = train_testvalid['test'].train_test_split(test_size=0.5,stratify_by_column='class')
# renaming each of the datasets for convenience
train_set = train_testvalid['train']
val_set = valid_test['train']
test_set = valid_test['test']
# converting the tokenized datasets to TensorFlow datasets
data_collator = DefaultDataCollator(return_tensors="tf")
tf_train_dataset = train_set.to_tf_dataset(
columns=["attention_mask", "input_ids", "token_type_ids"],
label_cols=['class'],
shuffle=True,
collate_fn=data_collator,
batch_size=8)
tf_validation_dataset = val_set.to_tf_dataset(
columns=["attention_mask", "input_ids", "token_type_ids"],
label_cols=['class'],
shuffle=False,
collate_fn=data_collator,
batch_size=8)
tf_test_dataset = test_set.to_tf_dataset(
columns=["attention_mask", "input_ids", "token_type_ids"],
label_cols=['class'],
shuffle=False,
collate_fn=data_collator,
batch_size=8)
# loading tensorflow model
model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=1)
# compiling the model
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=5e-6),
loss=tf.keras.losses.BinaryCrossentropy(),
metrics=tf.metrics.BinaryAccuracy())
# fitting model
history = model.fit(tf_train_dataset,
validation_data=tf_validation_dataset,
epochs=1)
# Evaluating the model on the test data using `evaluate`
results = model.evaluate(x=tf_test_dataset,verbose=2) # reports binary_accuracy: 0.9152
# first attempt at using model.predict method
hits = 0
misses = 0
for x, y in tf_test_dataset:
logits = tf.keras.backend.get_value(model(x, training=False).logits)
labels = tf.keras.backend.get_value(y)
for i in range(len(logits)):
if logits[i][0] < 0:
z = 0
else:
z = 1
if z == labels[i]:
hits += 1
else:
misses += 1
print(hits/(hits+misses)) # reports binary_accuracy: 0.8187
# second attempt at using model.predict method
modelPredictions = model.predict(tf_test_dataset).logits
testDataLabels = np.concatenate([y for x, y in tf_test_dataset], axis=0)
hits = 0
misses = 0
for i in range(len(modelPredictions)):
if modelPredictions[i][0] >= 0:
z = 1
else:
z = 0
if z == testDataLabels[i]:
hits += 1
else:
misses += 1
print(hits/(hits+misses)) # reports binary_accuracy: 0.8187
Things I've tried include:
different loss functions (it's a binary classification problem with the label column of the dataset filled with either a zero or a one for each row);
different ways of unpacking the test dataset and feeding it to model.predict;
altering the 'num_labels' parameter between 1 and 2.
I fixed the problem by changing the num_labels parameter to two and the loss function to sparse categorical cross entropy. (I then had to change my model.predict loop by taking the argmax of the two logits produced by the model.)

LSTM model for time series forecasting does not train proprely for some data

CONTEXT
I have a dataframe of monthly historical prices of market indices like so (all data comes from Bloomberg):
MSCI World S&P 500 ... HFRX Event Driven Gold Spot
1969-12-31 100 92.06 ... NaN NaN
1970-01-30 94.25 85.02 ... NaN NaN
... ... ... ... ... ...
2021-07-31 3141.35 4395.26 ... 20459.292 143.77
2021-08-31 3006.6 4522.68 ... 20614.276 134.06
I want to predict the value of each index for the next month with an LSTM NN (each index has its specially trained NN).
So a new LSTM model is initialized and trained on each of these time series (which all have from 300 to 1200 samples). This (Pytorch) LSTM model is the following:
class LSTMRegressor(nn.Module):
def __init__(self, input_size, hidden_size,sequence_size,num_layers,dropout):
super(LSTMRegressor,self).__init__()
self.input_size = input_size
self.hidden_size = hidden_size
self.sequence_size = sequence_size
self.num_layers=num_layers
self.droput = dropout
self.lstm = nn.LSTM(
input_size=self.input_size,
hidden_size=self.hidden_size,
num_layers=num_layers,
batch_first=True,
dropout=dropout)
self.linear = nn.Linear(in_features=hidden_size, out_features=1)
def forward(self, x):
lstm_out, self.hidden = self.lstm(x)
y_pred = self.linear(lstm_out[:,-1,:])
return y_pred
Loss function and optimizer:
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(),lr=learning_rate)
My parameters are the following:
input_size = 1
hidden_size=150
num_layers=2
dropout=0
batch_size = 16
learning_rate = 0.001
RESULTS
For most of the indexes, the training seems to work well as there is only about a mean error of 0.5% in the testing set (see an exemple in first graph below). However, for some of the indexes, the training does not work (about 100% error) (see an exemple in second graph below).
The graphs show training/validation loss and mape (mean average percentage error). The vertical red line is simply the best epoch calculated by an "early stopping algorithm".
Model that trained successfully (test == validation):
Model that trained unsuccessfully (test == validation):
QUESTIONS
Why do all LSTM models seem not to overfit (I've tested with ten of thousands of epochs)?
Why do some LSTM models do not train proprely? (they are not those with the least data)
Why do the models that do not train proprely have such smooth curves?
Thank you very much for your help!

How to Structure Three-Dimensional Lag TimeSteps for an LSTM in Keras?

I understand LSTMS require a three-dimensional dataset to function following this format, N_samples x TimeSteps x Variables. I want to restructure my data from a single timestep for all of my rows into Lag timesteps by hours. The idea is that the LSTM would then batch train from hour to hour (from 310033 rows x 1 Timestep x 83 Variables to 310033 rows x 60 Timestep x 83 Variables).
However, the losses of my model were weird (increasing training loss with epochs) and training accuracy decreased from the single time step to the lagged time steps. This makes me believe I did this transformation wrong. Is this the correct way to restructure the data or is there a better way to do so?
The data is time series data in 1 sec recordings and has already been preprocessed to be within a range of 0-1, One-Hot encoded, cleaned, etc...
Current Transformation in Python:
X_train, X_test, y_train, y_test = train_test_split(scaled, target, train_size=.7, shuffle = False)
#reshape input to be 3D [samples, timesteps, features]
#X_train = X_train.reshape((X_train.shape[0], 1, X_train.shape[1])) - Old method for 1 timestep
#X_test = X_test.reshape((X_test.shape[0], 1, X_test.shape[1])) - Old method for 1 timestep
#Generate Lag time Steps 3D framework for LSTM
#As required for LSTM networks, we must reshape the input data into N_samples x TimeSteps x Variables
hours = len(X_train)/3600
hours = math.floor(hours) #Most 60 min hours availible in subset of data
temp =[]
# Pull hours into the three dimensional feild
for hr in range(hours, len(X_train) + hours):
temp.append(scaled[hr - hours:hr, 0:scaled.shape[1]])
X_train = np.array(temp) #Export Train Features
hours = len(X_test)/3600
hours = math.floor(hours) #Most 60 min hours availible in subset of data
temp =[]
# Pull hours into the three dimensional feild
for hr in range(hours, len(X_test) + hours):
temp.append(scaled[hr - hours:hr, 0:scaled.shape[1]])
X_test = np.array(temp) #Export Test Features
Data Shape after Transformation:
Model Injection:
model.add(LSTM(128, return_sequences=True,
input_shape=(X_train.shape[1], X_train.shape[2])))
model.add(Dropout(0.15)) #15% drop out layer
#model.add(BatchNormalization())
#Layer 2
model.add(LSTM(128, return_sequences=False))
model.add(Dropout(0.15)) #15% drop out layer
#Layer 3 - return a single vector
model.add(Dense(32))
#Output of 2 because we have 2 classes
model.add(Dense(2, activation= 'sigmoid'))
# Define optimiser
opt = tf.keras.optimizers.Adam(learning_rate=1e-5, decay=1e-6)
# Compile model
model.compile(loss='sparse_categorical_crossentropy', # Mean Square Error Loss = 'mse'; Mean Absolute Error = 'mae'; sparse_categorical_crossentropy
optimizer=opt,
metrics=['accuracy'])
history = model.fit(X_train, y_train, epochs=epoch, batch_size=batch, validation_data=(X_test, y_test), verbose=2, shuffle=False)
Any input on how to improve performance or fix the Lag Timesteps?
Since you are trying to predict y against lagged and current values of x variables your y_train needs to start after 1st set of lagged values or y_train needs to be y_train[59:] and also your X_train needs to end withing training period and last observation of y_train should correspond to X_train which has latest data time point same as y_train. So take X_train[:y_train[59:].shape[0], 60, 83]
To elaborate a bit more, you need to fit:
X(t), X(t-1), X(t-2), ..., X(t-59) ---- > y(t)
X(t+1), X(t), X(t-1),..., X(t-58) ------> y(t+1)
The code you have written, if I am not wrong, is probably fitting the opposite:
X(t), X(t-1), X(t-2), ..., X(t-59) ---- > y(t-59)

Tensorflow - loss starts high and does not decrease

i started writing Neuronal Networks with tensorflow and there is one Problem i seem to face in each of my example Projects.
My loss allways starts at something like 50 or higher and does not decrease or if it does, it does so slowly that after all my epochs i do not even get near an acceptable loss-rate.
Things it already tried (and did not affect the result very much)
tested on overfitting, but in the following example
you can see that i have 15000 training and 15000 testing-datasets and
something like 900 neurons
tested different optimizers and optimizer-values
tried increasing the traingdata by using the testdata as
trainingdata aswell
tried increasing and decreasing the batchsize
I created the network on knowledge of https://youtu.be/vq2nnJ4g6N0
But let us have a look on one of my testprojects:
I have a list of names and wanted to assume the gender so my raw data looks like this:
names=["Maria","Paul","Emilia",...]
genders=["f","m","f",...]
For feeding it into the network i transform the names into an array of charCodes (expecting a maxlength of 30) and the gender into a bit array
names=[[77.,97. ,114.,105.,97. ,0. ,0.,...]
[80.,97. ,117.,108.,0. ,0. ,0.,...]
[69.,109.,105.,108.,105.,97.,0.,...]]
genders=[[1.,0.]
[0.,1.]
[1.,0.]]
I built the network with 3 hidden layers [30,20],[20,10],[10,10] and [10,2] for the output layer. All hidden layers have a ReLU as activation function. The output layer has a softmax.
# Input Layer
x = tf.placeholder(tf.float32, shape=[None, 30])
y_ = tf.placeholder(tf.float32, shape=[None, 2])
# Hidden Layers
# H1
W1 = tf.Variable(tf.truncated_normal([30, 20], stddev=0.1))
b1 = tf.Variable(tf.zeros([20]))
y1 = tf.nn.relu(tf.matmul(x, W1) + b1)
# H2
W2 = tf.Variable(tf.truncated_normal([20, 10], stddev=0.1))
b2 = tf.Variable(tf.zeros([10]))
y2 = tf.nn.relu(tf.matmul(y1, W2) + b2)
# H3
W3 = tf.Variable(tf.truncated_normal([10, 10], stddev=0.1))
b3 = tf.Variable(tf.zeros([10]))
y3 = tf.nn.relu(tf.matmul(y2, W3) + b3)
# Output Layer
W = tf.Variable(tf.truncated_normal([10, 2], stddev=0.1))
b = tf.Variable(tf.zeros([2]))
y = tf.nn.softmax(tf.matmul(y3, W) + b)
Now the calculation for the loss, accuracy and the training operation:
# Loss
cross_entropy = -tf.reduce_sum(y_*tf.log(y))
# Accuracy
is_correct = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(is_correct, tf.float32))
# Training
train_operation = tf.train.GradientDescentOptimizer(0.01).minimize(cross_entropy)
I train the network in batches of 100
sess = tf.Session()
sess.run(tf.global_variables_initializer())
for i in range(150):
bs = 100
index = i*bs
inputBatch = inputData[index:index+bs]
outputBatch = outputData[index:index+bs]
sess.run(train_operation, feed_dict={x: inputBatch, y_: outputBatch})
accuracyTrain, lossTrain = sess.run([accuracy, cross_entropy], feed_dict={x: inputBatch, y_: outputBatch})
if i%(bs/10) == 0:
print("step %d loss %.2f accuracy %.2f" % (i, lossTrain, accuracyTrain))
And i get the following result:
step 0 loss 68.96 accuracy 0.55
step 10 loss 69.32 accuracy 0.50
step 20 loss 69.31 accuracy 0.50
step 30 loss 69.31 accuracy 0.50
step 40 loss 69.29 accuracy 0.51
step 50 loss 69.90 accuracy 0.53
step 60 loss 68.92 accuracy 0.55
step 70 loss 68.99 accuracy 0.55
step 80 loss 69.49 accuracy 0.49
step 90 loss 69.25 accuracy 0.52
step 100 loss 69.39 accuracy 0.49
step 110 loss 69.32 accuracy 0.47
step 120 loss 67.17 accuracy 0.61
step 130 loss 69.34 accuracy 0.50
step 140 loss 69.33 accuracy 0.47
What am i doing wrong?
Why does it start at ~69 in my Project and not lower?
Thank you very much guys!
There's nothing wrong with 0.69 nats of entropy per samples, as a starting point for a binary classification.
If you convert to base 2, 0.69/log(2), you'll see that it's almost exactly 1 bit per sample which is exactly what you would expect if you're unsure about a binary classification.
I usually use the mean loss instead of the sum so things are less sensitive to batch size.
You should also not calculate the entropy directly yourself, because that method breaks easily. you probably want tf.nn.sigmoid_cross_entropy_with_logits.
I also like starting with the Adam Optimizer instead of pure gradient descent.
Here are two reasons you might be having some trouble with this problem:
1) Character codes are ordered, but the order doesn't mean anything. Your inputs would be easier for the network to take as input if they were input as one-hot vectors. So your input would be a 26x30 = 780 element vector. Without that the network has to waste a bunch of capacity learning the boundaries between letters.
2) You've only got fully connected layers. This makes it impossible for it to learn a fact independent of it's absolute position in the name. 6 of the top 10 girls names in 2015 ended in 'a', while 0 of the top 10 boys names did. As currently written your network needs to re-learn "Usually it's a girl's name if it ends in 'a'" independently for each name length. Using some convolution layers would allow it to learn facts once across all name lengths.