I'm making a model for predicting the age of people by analyzing their face. I'm using this pretrained model, and maked a custom loss function and a custom metrics. So I obtain discrete result but I want to improve it. In particular, I noticed that after some epochs the model begin to overfitt on the training set then the val_loss increases. How can I avoid this? I'm already using Dropout, but this doesn't seem to be enough.
I think maybe I should use l1 and l2 but I don't know how.
def resnet_model():
model = VGGFace(model = 'resnet50')#model :{resnet50, vgg16, senet50}
xl = model.get_layer('avg_pool').output
x = keras.layers.Flatten(name='flatten')(xl)
x = keras.layers.Dense(4096, activation='relu')(x)
x = keras.layers.Dropout(0.5)(x)
x = keras.layers.Dense(4096, activation='relu')(x)
x = keras.layers.Dropout(0.5)(x)
x = keras.layers.Dense(11, activation='softmax', name='predictions')(x)
model = keras.engine.Model(model.input, outputs = x)
return model
model = resnet_model()
initial_learning_rate = 0.0003
epochs = 20; batch_size = 110
num_steps = train_x.shape[0]//batch_size
learning_rate_fn = tf.keras.optimizers.schedules.PiecewiseConstantDecay(
[3*num_steps, 10*num_steps, 16*num_steps, 25*num_steps],
[1e-4, 1e-5, 1e-6, 1e-7, 5e-7]
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate_fn)
model.compile(loss=custom_loss, optimizer=optimizer, metrics=['accuracy', one_off_accuracy]), train_y, epochs=epochs, batch_size=batch_size, validation_data=(test_x, test_y))
This is an example of result:

There are many regularization methods to help you avoid overfitting your model:
Randomly disables neurons during the training, in order to force other neurons to be trained as well.
L1/L2 penalties:
Penalizes weights that change dramatically. This tries to ensure that all parameters will be equally taken into consideration when classifying an input.
Random Gaussian Noise at the inputs:
Adds random gaussian noise at the inputs: x = x + r where r is a random normal value from range [-1, 1]. This will confuse your model and prevent it from overfitting into your dataset, because in every epoch, each input will be different.
Label Smoothing:
Instead of saying that a target is 0 or 1, You can smooth those values (e.g. 0.1 & 0.9).
Early Stopping:
This is a quite common technique for avoiding training your model too much. If you notice that your model's loss is decreasing along with the validation's accuracy, then this is a good sign to stop the training, as your model begins to overfit.
K-Fold Cross-Validation:
This is a very strong technique, which ensures that your model is not fed all the time with the same inputs and is not overfitting.
Data Augmentations:
By rotating/shifting/zooming/flipping/padding etc. an image you make sure that your model is forced to train better its parameters and not overfit to the existing dataset.
I am quite sure there are also more techniques to avoid overfitting.

You can try incorporate image augmentation in your training, which increases the "sample size" of your data as well as the "diversity" as #Suraj S Jain mentioned.


Plateaus training loss but validation loss decreases

I am training a fully convolutional regressor, with mobilenet as its backbone. I have already overcome a massive overfitting problem by augmenting the data. However, the training loss seems to be stuck after a couple of epochs. On the other hand, validation loss reduces for a reasonable number of epochs and plateaus. Here are the learning curves.
Learning curves
And here is the architecture and compiling parameters.
backbone = keras.applications.ResNet50(weights="imagenet",
input_shape=(shape[0], shape[1], 3)
backbone.trainable = True
inputs = layers.Input((shape[0], shape[1], 3))
x = keras.applications.resnet.preprocess_input(inputs)
x = backbone(x)
x = layers.Dropout(0.5)(x)
x = layers.SeparableConv2D(
N, kernel_size=5, strides=1, activation="relu"
outputs = layers.SeparableConv2D(
N, kernel_size=3, strides=1
model = keras.Model(inputs, outputs, name="mobilenet_FullyConv_rect")
model.compile(loss="mse", optimizer=keras.optimizers.Adam(1e-4))
history =, validation_data=val_gen,
Any idea of why training loss faces plateaus sooner than val loss is appreciated. Also any suggestions to how I can reach smaller loss values instead of facing plateaus soon.
I have tried using more complex backbones like ResNet50 and DenseNet169, as I suspected underfitting. However, this did not solve anything and validation loss was having massive fluctuations. I also tried augmenting my data even more ( trippling the data instead of doubling ). This did not help very much either, which makes me believe that this is actually underfitting, since feeding more data did not lead to improvements. To put it short, I am kind of lost between bringing more data or making my architecture more complex.

My model fit too slow, tringle of val_loss is 90

I have a task to write a neural network. On input of 9 neurons, and output of 4 neurons for a multiclass classification problem. I have tried different models and for all of them:
Drop-out mechanism is used.
Batch normalization is used.
And the resulting neural networks all are overfitting. Precision is <80%, I want to have min 90% precision. Loss is 0.8 on the median.
Please, can you suggest to me what model I should use?
TMS_coefficients.RData file
Part of my code:
(trainX, testX, trainY, testY) = train_test_split(dataset,
values, test_size=0.25, random_state=42)
# модель нейронки
visible = layers.Input(shape=(9,))
hidden0 = layers.Dense(64, activation="tanh")(visible)
batch0 = layers.BatchNormalization()(hidden0)
drop0 = layers.Dropout(0.3)(batch0)
hidden1 = layers.Dense(32, activation="tanh")(drop0)
batch1 = layers.BatchNormalization()(hidden1)
drop1 = layers.Dropout(0.2)(batch1)
hidden2 = layers.Dense(128, activation="tanh")(drop1)
batch2 = layers.BatchNormalization()(hidden2)
drop2 = layers.Dropout(0.5)(batch2)
hidden3 = layers.Dense(64, activation="tanh")(drop2)
batch3 = layers.BatchNormalization()(hidden3)
output = layers.Dense(4, activation="softmax")(batch3)
model = tf.keras.Model(inputs=visible, outputs=output)
history =, trainY, validation_data=(testX, testY), epochs=5000, batch_size=256)
From the loss curve, I can say it is not overfitting at all! In fact, your model is underfitting. Why? because, when you have stopped training, the loss curve for the validation set has not become flat yet. That means, your model still has the potential to do well if it was trained more.
The model overfits when the training loss is decreasing (or remains the same) but the validation loss gradually increases without decreasing. This is clearly not the case
So, what you can do:
Try training longer.
Add more layers.
Try different activation functions like ReLU instead of tanh.
Use lower dropout (probably your model is struggling to learn for high value of dropouts).
Make sure you have shuffled your data before train-test splitting (if you are using sklearn for train_test_split() then it is done by default) and also check if the test data is similar to the train data and both of them goes under the same preprocessing steps.

Tensorflow Polynomial Linear Regression curve fit

I have created this Linear regression model using Tensorflow (Keras). However, I am not getting good results and my model is trying to fit the points around a linear line. I believe fitting points around degree 'n' polynomial can give better results. I have looked googled how to change my model to polynomial linear regression using Tensorflow Keras, but could not find a good resource. Any recommendation on how to improve the prediction?
I have a large dataset. Shuffled it first and then spited to 80% training and 20% Testing. Also dataset is normalized.
1) Building model:
def build_model():
model = keras.Sequential()
model.add(keras.layers.Dense(units=300, input_dim=32))
#sigmoid tanh softmax relu
optimizer = tf.train.RMSPropOptimizer(0.001,
#optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.1)
return model
model = build_model()
2) Train the model:
class PrintDot(keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs):
if epoch % 100 == 0: print('')
print('.', end='')
EPOCHS = 500
# Store training stats
history =, train_labels, epochs=EPOCHS,
validation_split=0.2, verbose=1,
3) plot Train loss and val loss
enter image description here
4) Stop When results does not get improved
enter image description here
5) Evaluate the result
[loss, mae] = model.evaluate(test_data, test_labels, verbose=0)
#Testing set Mean Abs Error: 1.9020842795676374
6) Predict:
test_predictions = model.predict(test_data).flatten()
enter image description here
7) Prediction error:
enter image description here
Polynomial regression is a linear regression with some extra additional input features which are the polynomial functions of original input features.
let the original input features are : (x1,x2,x3,...)
Generate a set of polynomial functions by adding some transformations of the original features, for example: (x12, x23, x13x2,...).
One may decide which all functions are to be included depending on their constraints such as intuition on correlation to the target values, computational resources, and training time.
Append these new features to the original input feature vector. Now the transformed input feature vector has a size of len(x1,x2,x3,...) + len(x12, x23, x13x2,...)
Further, this updated set of input features (x1,x2,x3,x12, x23, x13x2,...) is feeded into the normal linear regression model. ANN's architecture may be tuned again to get the best trained model.
PS: I see that your network is huge while the number of inputs is only 32 - this is not a common scale of architecture. Even in this particular linear model, reducing the hidden layers to one or two hidden layers may help in training better models (It's a suggestion with an assumption that this particular dataset is similar to other generally seen regression datasets)
I've actually created polynomial layers for Tensorflow 2.0, though these may not be exactly what you are looking for. If they are, you could use those layers directly or follow the procedure used there to create a more general layer

Keras LSTM always underfits

I am trying to train an LSTM with Keras and Tensorflow backend but it seems to always underfit; the loss and validation loss curves have an initial drop and then flatten out very fast (see image). I have tried adding more layers, more neurons, no dropout, etc., but can't get it even anywhere near an overfit and I do have a good bit of data (almost 4 hours with 100 samples per second, and I have tried downsampling to 50/sec).
My problem is multidimensional time series prediction with continuous values.
Any ideas would be appreciated!
Here is my basic keras architecture:
data_dim = 30 #input dimensions => each timestep has 30 features
timesteps = 200
out_dim = 30 #output dimensions => each predicted output timestep
# has 30 dimensions
batch_size = 50
num_epochs = 300
learning_rate = 0.0005 #tried values between around 0.001 and 0.0003
#hidden layers size
h1 = 120
h2 = 340
h3 = 340
h4 = 120
model = Sequential()
model.add(LSTM(h1, return_sequences=True,input_shape=(timesteps, data_dim)))
model.add(LSTM(h2, return_sequences=True))
model.add(LSTM(h3, return_sequences=True))
model.add(LSTM(h4, return_sequences=True))
model.add(Dense(out_dim, activation='linear'))
rmsprop_otim = keras.optimizers.RMSprop(lr=learning_rate, rho=0.9, epsilon=1e-08, decay=decay)
model.compile(loss='mean_squared_error', optimizer=rmsprop_otim,metrics=['mse'])
#data preparation
[x_train, y_train] = readData()
x_train = x_train.reshape((int(num_samples/timesteps),timesteps,data_dim))
y_train = y_train.reshape((int(num_samples/timesteps),timesteps,num_classes))
history_callback =, y_train, validation_split=0.1,
batch_size=batch_size, epochs=num_epochs,shuffle=False,callbacks=[checkpointer, losses])
When you say 0.06 mse is underfit, this depends lot on data distribution. mse is relative term, so if the data is not normalizaed, 0.06 might even be overfit. In such case, pre-processing might help. Also, check if there is significant noise in the data.
Using 4 LSTM layers with large sizes means a lot of parameters to learn. Lesser number of layers might be enough.
Try non-linear activation in the final layer.
I suspect that your model only learns the weights of the Dense Layer properly, but not those of the LSTM layers below. As a quick check, what kind of performance do you get when you get rid of all LSTM layers and replace the Dense with a TimeDistributed(Dense...) layer? If your graphs look the same, training doesn't work, i.e. error gradients with respect to the lower layer weights may be too small. Another way of checking this is to inspect the gradients directly and/or to compare the final weighs after training with the initial weights. If this is indeed the problem you can try the following 1) Standardize your inputs, 2) use a smaller learning rate (decrease logarithmically), 3) add skip layer connections, 4) use Adam instead of RMSprop, and 5) train for more epochs.

The CNN model does not learn when adding one/two more convolutional layers

I am trying to implement the model in the picture in tensorflow. Just instead of 6 output neurons I have 1000. I am doing this to have pretrained weights.
I implemented the full model but with only one (14,14,128) layer; just for testing and so on. Now that the porgram is matured I implemented the two more layers (or one). This makes the model not learning anything; loss is constant (around a small noise) and the accuracy tested on the training images is constant at random guess. Befor adding the layers I could get very fast (5-10 min) to accuracy of 70-80 percent in 1000 image-subset from the train dataset. As said this is not the case with the additional layers.
Here is the code for the additional layer where s1 and s2 is the stride of the conv:
w2 = weight_variable([3,3,64,128])
b2 = bias_variable([128])
h2 = tf.nn.relu(conv2d_s2(h1_pool,w2)+b2)
h2_pool = max_pool_2x2(h2)
#Starts additional layer
w3 = weight_variable([3,3,128,128])
b3 = bias_variable([128])
h3 = tf.nn.relu(conv2d_s1(h2_pool,w3)+b3)
#Ends additional layer
w5 = weight_variable([3,3,128,256])
b5 = bias_variable([256])
h5 = tf.nn.relu(conv2d_s1(h3,w5)+b5)
h5_pool = max_pool_2x2(h5)
This extra layer makes the model worthless. I have tried different hyperparameters (learning rate, batch size, epochs) without success. Where lies the problem?
Another question could be: does anyone know a small (and/or better) network of this size so I can imprement and test. My goal is to detect grasp positions in different objects (images of objects)?
If it helps, I use one GTX 980 with a very very good xeon.
The repository can be found in
The problem was divergence in loss. Solved by lowering the learning rate
In orange is the accuracy and loss (tensorboard and terminal) when the program showed no learning at all. I was fooled by the loss showed in termminal. As pointed by #hars to check the logs of accuracy and loss I discovered that the loss in tensorboard diverges in the first steps. By changing the learning rate from 0,01 to 0,001 the divergence dissapeared and as you can see in cyan the model was learning (overfited a small subset of imagente in 1 minute).
You have a ReLU layer at the end of the model which might be clipping all gradients, followed by a softmax with logits in the training part. Therefore, the model might get stuck at poor local-minima.
Try removing tf.nn.relu in the last line of your inference and see if it trains well.
Here is your part of the code:
last few lines of model :
# fc1 layer
W_fc3 = weight_variable([512, 1000])
b_fc3 = bias_variable([1000])
output = tf.nn.relu(tf.matmul(h_fc2, W_fc3) + b_fc3)
#print("output: {}".format(output.get_shape()))
return output
training part of the code:
logits = inference_redmon.inference(images)
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(
logits=logits, labels=labels_one))
tf.summary.scalar('loss', loss)
correct_pred = tf.equal( tf.argmax(logits,1), tf.argmax(labels_one,1))
accuracy = tf.reduce_mean( tf.cast( correct_pred, tf.float32))