Prevent exploding loss function multi-step multi-variate/output forecast ConvLSTM - tensorflow

I have a plausible problem that currently fails to solve. During the training, my loss function explodes becomes inf or NaN, because the MSE of all errors becomes huge if the predictions (at the beginning of the training) are worse. And that is the normal intended behavior and correct. But, how do I train a ConvLSTM to which loss function to be able to learn a multi-step multi-variate output?
E.g. i try a (32, None, 200, 12) to predict (32, None, 30, 12). 32 is the batch size, None is the number of samples (>3000). 200 is the number of time steps, 12 features wide. 30 output time steps, 12 features wide.
My ConvLSTM model:
input = Input(shape=(None, input_shape[1]))
conv1d_1 = Conv1D(filters=64, kernel_size=3, activation=LeakyReLU())(input)
conv1d_2 = Conv1D(filters=32, kernel_size=3, activation=LeakyReLU())(conv1d_1)
dropout = Dropout(0.3)(conv1d_2)
lstm_1 = LSTM(32, activation=LeakyReLU())(dropout)
dense_1 = Dense(forecasting_horizon * input_shape[1], activation=LeakyReLU())(lstm_1)
output = Reshape((forecasting_horizon, input_shape[1]))(dense_1)
model = Model(inputs=input, outputs=output)
My ds generation:
ds_inputs = tf.keras.utils.timeseries_dataset_from_array(df[:-forecast_horizon], None, sequence_length=window_size, sequence_stride=1,
shuffle=False, batch_size=None)
ds_targets = tf.keras.utils.timeseries_dataset_from_array(df[forecast_horizon:], None, sequence_length=forecast_horizon, sequence_stride=1,
shuffle=False, batch_size=None)
ds_inputs = ds_inputs.batch(batch_size, drop_remainder=True)
ds_targets = ds_targets.batch(batch_size, drop_remainder=True)
ds =, ds_targets))
ds = ds.shuffle(buffer_size=(len(ds)))
Besides MSE, I already tried MeanAbsoluteError, MeanSquaredLogarithmicError, MeanAbsolutePercentageError, CosineSimilarity. Where the last, produce non-sense. MSLE works best but does not favor large errors and therefore the MSE (used as metric has an incredible variation during training). Additionally, after a while, the Network becomes stale and gets no better loss (my explanation is that the difference in loss becomes too minor on the logarithmic scale and therefore the weights cannot be well adjusted).

I can partially answer my own question. One issue is that I used ReLu/LeakyReLu which will lead to exploding gradient problem because the RNN/LSTM Layer applies the same weights over time leading to exploding values as the values add up. Weights will not be reduced by any chance (ReLu min == 0). With Tanh as activation, it is possible to have negative values which also allow a reduction of the internal weights and really minimize the chance of exploding weights/predictions within the network. After some tests, the LSTM layer stays numerical stable.


Plateaus training loss but validation loss decreases

I am training a fully convolutional regressor, with mobilenet as its backbone. I have already overcome a massive overfitting problem by augmenting the data. However, the training loss seems to be stuck after a couple of epochs. On the other hand, validation loss reduces for a reasonable number of epochs and plateaus. Here are the learning curves.
Learning curves
And here is the architecture and compiling parameters.
backbone = keras.applications.ResNet50(weights="imagenet",
input_shape=(shape[0], shape[1], 3)
backbone.trainable = True
inputs = layers.Input((shape[0], shape[1], 3))
x = keras.applications.resnet.preprocess_input(inputs)
x = backbone(x)
x = layers.Dropout(0.5)(x)
x = layers.SeparableConv2D(
N, kernel_size=5, strides=1, activation="relu"
outputs = layers.SeparableConv2D(
N, kernel_size=3, strides=1
model = keras.Model(inputs, outputs, name="mobilenet_FullyConv_rect")
model.compile(loss="mse", optimizer=keras.optimizers.Adam(1e-4))
history =, validation_data=val_gen,
Any idea of why training loss faces plateaus sooner than val loss is appreciated. Also any suggestions to how I can reach smaller loss values instead of facing plateaus soon.
I have tried using more complex backbones like ResNet50 and DenseNet169, as I suspected underfitting. However, this did not solve anything and validation loss was having massive fluctuations. I also tried augmenting my data even more ( trippling the data instead of doubling ). This did not help very much either, which makes me believe that this is actually underfitting, since feeding more data did not lead to improvements. To put it short, I am kind of lost between bringing more data or making my architecture more complex.

My model fit too slow, tringle of val_loss is 90

I have a task to write a neural network. On input of 9 neurons, and output of 4 neurons for a multiclass classification problem. I have tried different models and for all of them:
Drop-out mechanism is used.
Batch normalization is used.
And the resulting neural networks all are overfitting. Precision is <80%, I want to have min 90% precision. Loss is 0.8 on the median.
Please, can you suggest to me what model I should use?
TMS_coefficients.RData file
Part of my code:
(trainX, testX, trainY, testY) = train_test_split(dataset,
values, test_size=0.25, random_state=42)
# модель нейронки
visible = layers.Input(shape=(9,))
hidden0 = layers.Dense(64, activation="tanh")(visible)
batch0 = layers.BatchNormalization()(hidden0)
drop0 = layers.Dropout(0.3)(batch0)
hidden1 = layers.Dense(32, activation="tanh")(drop0)
batch1 = layers.BatchNormalization()(hidden1)
drop1 = layers.Dropout(0.2)(batch1)
hidden2 = layers.Dense(128, activation="tanh")(drop1)
batch2 = layers.BatchNormalization()(hidden2)
drop2 = layers.Dropout(0.5)(batch2)
hidden3 = layers.Dense(64, activation="tanh")(drop2)
batch3 = layers.BatchNormalization()(hidden3)
output = layers.Dense(4, activation="softmax")(batch3)
model = tf.keras.Model(inputs=visible, outputs=output)
history =, trainY, validation_data=(testX, testY), epochs=5000, batch_size=256)
From the loss curve, I can say it is not overfitting at all! In fact, your model is underfitting. Why? because, when you have stopped training, the loss curve for the validation set has not become flat yet. That means, your model still has the potential to do well if it was trained more.
The model overfits when the training loss is decreasing (or remains the same) but the validation loss gradually increases without decreasing. This is clearly not the case
So, what you can do:
Try training longer.
Add more layers.
Try different activation functions like ReLU instead of tanh.
Use lower dropout (probably your model is struggling to learn for high value of dropouts).
Make sure you have shuffled your data before train-test splitting (if you are using sklearn for train_test_split() then it is done by default) and also check if the test data is similar to the train data and both of them goes under the same preprocessing steps.

Low evaluation accuracy of Resnet in TensorFlow Federated

I implemented Resnet34 model in federated images classification tutorial. After 10 rounds the training accuracy can be higher than 90%, however, the evaluation accuracy using the last round's state.model is always around 50%.
evaluation = tff.learning.build_federated_evaluation(model_fn)
federated_test_data = make_federated_data(emnist_test, sample_clients)
test_metrics = evaluation(state.model, federated_test_data)
I am very confused what's possibly wrong with the evaluation part? Also, I printed the untrainable variables (mean and variance in BatchNorm) of the server's model, which are 0 and 1 with no updates/averaging after those rounds. Should they be like that or that could be the problem?
Thanks very much!
The codes to prepare training data and printed results:
OrderedDict([('label', TensorSpec(shape=(), dtype=tf.int64, name=None)),('pixels',TensorSpec(shape=(256, 256, 3), dtype=tf.float32, name=None))])
def preprocess(dataset):
def element_fn(element):
return collections.OrderedDict([
('x', element['pixels']),
('y', tf.reshape(element['label'], [1])),
return dataset.repeat(NUM_EPOCHS).map(element_fn).shuffle(
sample_clients = emnist_train.client_ids[0:NUM_CLIENTS]
federated_train_data = make_federated_data(emnist_train, sample_clients)
preprocessed_example_dataset = preprocess(example_dataset)
sample_batch = tf.nest.map_structure(
lambda x: x.numpy(), iter(preprocessed_example_dataset).next())
def make_federated_data(client_data, client_ids):
return [preprocess(client_data.create_tf_dataset_for_client(x))
for x in client_ids]
len(federated_train_data), federated_train_data[0]
(4,<BatchDataset shapes: OrderedDict([(x, (None, 256, 256, 3)), (y, (None, 1))]), types: OrderedDict([(x, tf.float32), (y, tf.int64)])>)
The training and evaluation codes:
def create_compiled_keras_model():
base_model = tf.keras.applications.resnet.ResNet50(include_top=False, weights='imagenet', input_shape=(256,256,3,))
global_average_layer = tf.keras.layers.GlobalAveragePooling2D()
prediction_layer = tf.keras.layers.Dense(2, activation='softmax')
model = tf.keras.Sequential([
model.compile(optimizer = tf.keras.optimizers.SGD(lr = 0.001, momentum=0.9), loss = tf.keras.losses.SparseCategoricalCrossentropy(), metrics = [tf.keras.metrics.SparseCategoricalAccuracy()])
return model
def model_fn():
keras_model = create_compiled_keras_model()
return tff.learning.from_compiled_keras_model(keras_model, sample_batch)
iterative_process = tff.learning.build_federated_averaging_process(model_fn)
state = iterative_process.initialize()
for round_num in range(2, 12):
state, metrics =, federated_train_data)
print('round {:2d}, metrics={}'.format(round_num, metrics, state))
evaluation = tff.learning.build_federated_evaluation(model_fn)
federated_test_data = make_federated_data(emnist_test, sample_clients)
len(federated_test_data), federated_test_data[0]
<BatchDataset shapes: OrderedDict([(x, (None, 256, 256, 3)), (y, (None, 1))]), types: OrderedDict([(x, tf.float32), (y, tf.int64)])>)
test_metrics = evaluation(state.model, federated_test_data)
The training and evaluations results after each round:
round 1, metrics=<sparse_categorical_accuracy=0.5089045763015747,loss=0.7813001871109009,keras_training_time_client_sum_sec=0.008826255798339844>
round 2, metrics=<sparse_categorical_accuracy=0.519825279712677,loss=0.7640910148620605,keras_training_time_client_sum_sec=0.011750459671020508>
round 3, metrics=<sparse_categorical_accuracy=0.5099126100540161,loss=0.7513422966003418,keras_training_time_client_sum_sec=0.0039823055267333984>
round 4, metrics=<sparse_categorical_accuracy=0.5278897881507874,loss=0.7905193567276001,keras_training_time_client_sum_sec=0.0010638236999511719>
round 5, metrics=<sparse_categorical_accuracy=0.5199933052062988,loss=0.7782396674156189,keras_training_time_client_sum_sec=0.012729644775390625>
There are a few nuances and a few open research problems in Federated Learning and this question has struck a couple of them.
Training loss looks much better than evaluation loss: when using Federated Averaging (the optimization algorithm used in the Federated Learning for Image Classification tutorial) one needs to be careful interpreting metrics as they have nuanced differences from centralized model training. Especially training loss, which is the average over many sequence steps or batches. This means after one round, each client may have fit the model to their local data very well (obtaining a high accuracy), but after averaging these updates into the global model the global model may still be far away from "good", resulting in a low test accuracy. Additionally, 10 rounds may be too few; one of the original academic papers on Federated Learning demonstrated at least 20 rounds until 99% accuracy (McMahan 2016) with IID data, and more than 100 rounds in with non-IID data.
BatchNorm in the federated setting: its an open research problem on how to combine the batchnorm parameters, particularly with non-IID client data. Should each new client start with fresh parameters, or receive the global model parameters? TFF may not be communicating them between the server and client (since it currently is implemented only to communicate trainable variables), and may be leading to unexpected behavior. It may we good to print the state parameters watch what happens each round to them.
I found that the initialization is the reason why ResNet has poor performance. It is possibly because that ttf uses relatively simple state initialization which doesn't consider some layers like batch norm, so when I assigned the normal Keras model initial weights to the server instead of using its default initialization, the federated results were much better.

Why is my convolutional Neural Network stuck in a local minimum?

I've heard that machine learning algorithms rarely get stuck in local minima, but my CNN (in tensorflow) is predicting a constant output for all values and I am using a mean square error loss function so I think this must be a local minima given the properties of MSE. I have a network with 2 convolution layers and 1 dense layer (+1 dense output layer for regression) with 24, 32 and 100 neurons respectively, but I've tried changing the numbers of layers/neurons and the issue is not solved. I have relu activations for the hidden layers and absolute value on the output layer (I know this is uncommon but it converges faster to a lower MSE than the softplus function which still has the same problem and I need strictly positive outputs). I also have a 50% dropout layer between the dense and output layers and a pooling layer between the 2 convolutions. I have also tried changing the learning rate (currently 0.0001) and batch size. I am using an Adam Optimizer.
I have seen it suggested to change/add bias but I'm not sure how to initialize it in tf.layers.conv2d/tf.layers.dense (for which I have bias=True), and I can't see any options for bias with tf.nn.conv2d which I used for my first layer so I could initialize the kernel easily.
Any suggestions would be really appreciated, thanks.
Here's the section of my code with the network:
filter_shape = [3,3,12,24]
def nn_model(input):
weights = tf.Variable(tf.truncated_normal(filter_shape, mean=10,
stddev=3), name='weights')
conv1 = tf.nn.conv2d(input, weights, [1,1,1,1], padding='SAME')
conv2 = tf.layers.conv2d(inputs=conv1, filters=32, kernel_size=[3,3],
padding="same", activation=tf.nn.relu)
pool = tf.layers.max_pooling2d(inputs=conv2, pool_size=[2, 2], strides=2,
flat = tf.reshape(pool, [-1, 32*3*3])
dense_3 = tf.layers.dense(flat, neurons, activation = tf.nn.relu)
dropout_2 = tf.layers.dropout(dense_3, rate = rate)
prediction = tf.layers.dense(dropout_2, 1, activation=tf.nn.softplus)
return prediction
My inputs are 5x5 images with 12 channels of environmental data and I have ~100,000 training samples. My current MSE is ~90 on values of ~25.
I used to face the same problem with bigger images. I incresed the number of convolution layers to solve it. Maybe you should try to add even more convolution layers.
In my opinion, the problem comes from the fact you don't have enough parameters and thus get stuck in a local minimum. If you increase your number of parameters, it can help the updates to converge to a better minimum.
Also, I can't see the optimizer you are using. Is it Adam ? You can try to start with a bigger learning-rate and use a decay to decrease it epoch after epoch.

Keras LSTM always underfits

I am trying to train an LSTM with Keras and Tensorflow backend but it seems to always underfit; the loss and validation loss curves have an initial drop and then flatten out very fast (see image). I have tried adding more layers, more neurons, no dropout, etc., but can't get it even anywhere near an overfit and I do have a good bit of data (almost 4 hours with 100 samples per second, and I have tried downsampling to 50/sec).
My problem is multidimensional time series prediction with continuous values.
Any ideas would be appreciated!
Here is my basic keras architecture:
data_dim = 30 #input dimensions => each timestep has 30 features
timesteps = 200
out_dim = 30 #output dimensions => each predicted output timestep
# has 30 dimensions
batch_size = 50
num_epochs = 300
learning_rate = 0.0005 #tried values between around 0.001 and 0.0003
#hidden layers size
h1 = 120
h2 = 340
h3 = 340
h4 = 120
model = Sequential()
model.add(LSTM(h1, return_sequences=True,input_shape=(timesteps, data_dim)))
model.add(LSTM(h2, return_sequences=True))
model.add(LSTM(h3, return_sequences=True))
model.add(LSTM(h4, return_sequences=True))
model.add(Dense(out_dim, activation='linear'))
rmsprop_otim = keras.optimizers.RMSprop(lr=learning_rate, rho=0.9, epsilon=1e-08, decay=decay)
model.compile(loss='mean_squared_error', optimizer=rmsprop_otim,metrics=['mse'])
#data preparation
[x_train, y_train] = readData()
x_train = x_train.reshape((int(num_samples/timesteps),timesteps,data_dim))
y_train = y_train.reshape((int(num_samples/timesteps),timesteps,num_classes))
history_callback =, y_train, validation_split=0.1,
batch_size=batch_size, epochs=num_epochs,shuffle=False,callbacks=[checkpointer, losses])
When you say 0.06 mse is underfit, this depends lot on data distribution. mse is relative term, so if the data is not normalizaed, 0.06 might even be overfit. In such case, pre-processing might help. Also, check if there is significant noise in the data.
Using 4 LSTM layers with large sizes means a lot of parameters to learn. Lesser number of layers might be enough.
Try non-linear activation in the final layer.
I suspect that your model only learns the weights of the Dense Layer properly, but not those of the LSTM layers below. As a quick check, what kind of performance do you get when you get rid of all LSTM layers and replace the Dense with a TimeDistributed(Dense...) layer? If your graphs look the same, training doesn't work, i.e. error gradients with respect to the lower layer weights may be too small. Another way of checking this is to inspect the gradients directly and/or to compare the final weighs after training with the initial weights. If this is indeed the problem you can try the following 1) Standardize your inputs, 2) use a smaller learning rate (decrease logarithmically), 3) add skip layer connections, 4) use Adam instead of RMSprop, and 5) train for more epochs.