MLP output of first layer is zero after one epoch - tensorflow

I've been running into an issue lately trying to train a simple MLP.
I'm basically trying to get a network to map the XYZ position and RPY orientation of the end-effector of a robot arm (6-dimensional input) to the angle of every joint of the robot arm to reach that position (6-dimensional output), so this is a regression problem.
I've generated a dataset using the angles to compute the current position, and generated datasets with 5k, 500k and 500M sets of values.
My issue is the MLP I'm using doesn't learn anything at all. Using Tensorboard (I'm using Keras), I've realized that the output of my very first layer is always zero (see image 1), no matter what I try.
Basically, my input is a shape (6,) vector and the output is also a shape (6,) vector.
Here is what I've tried so far, without success:
I've tried MLPs with 2 layers of size 12, 24; 2 layers of size 48, 48; 4 layers of size 12, 24, 24, 48.
Adam, SGD, RMSprop optimizers
Learning rates ranging from 0.15 to 0.001, with and without decay
Both Mean Squared Error (MSE) and Mean Absolute Error (MAE) as the loss function
Normalizing the input data, and not normalizing it (the first 3 values are between -3 and +3, the last 3 are between -pi and pi)
Batch sizes of 1, 10, 32
Tested the MLP of all 3 datasets of 5k values, 500k values and 5M values.
Tested with number of epoches ranging from 10 to 1000
Tested multiple initializers for the bias and kernel.
Tested both the Sequential model and the Keras functional API (to make sure the issue wasn't how I called the model)
All 3 of sigmoid, relu and tanh activation functions for the hidden layers (the last layer is a linear activation because its a regression)
Additionally, I've tried the very same MLP architecture on the basic Boston housing price regression dataset by Keras, and the net was definitely learning something, which leads me to believe that there may be some kind of issue with my data. However, I'm at a complete loss as to what it may be as the system in its current state does not learn anything at all, the loss function just stalls starting on the 1st epoch.
Any help or lead would be appreciated, and I will gladly provide code or data if needed!
Thank you
Here's a link to 5k samples of the data I'm using. Columns B-G are the output (angles used to generate the position/orientation) and columns H-M are the input (XYZ position and RPY orientation).
Also, here's a snippet of the code I'm using:
df = pd.read_csv('kinova_jaco_data_5k.csv', names = ['state0',
states = np.asarray(
[df.state0.to_numpy(), df.state1.to_numpy(), df.state2.to_numpy(), df.state3.to_numpy(), df.state4.to_numpy(),
poses = np.asarray(
[df.pose0.to_numpy(), df.pose1.to_numpy(), df.pose2.to_numpy(), df.pose3.to_numpy(), df.pose4.to_numpy(),
x_train_temp, x_test, y_train_temp, y_test = train_test_split(poses, states, test_size=0.2)
x_train, x_val, y_train, y_val = train_test_split(x_train_temp, y_train_temp, test_size=0.2)
mean = x_train.mean(axis=0)
x_train -= mean
std = x_train.std(axis=0)
x_train /= std
x_test -= mean
x_test /= std
x_val -= mean
x_val /= std
n_epochs = 100
n_units=[48, 48]
inputs = Input(shape=(6,), dtype= 'float32', name = 'input')
x = Dense(units=n_units[0], activation=relu, name='dense1')(inputs)
for i in range(1, n_hidden_layers):
x = Dense(units=n_units[i], activation=activation, name='dense'+str(i+1))(x)
out = Dense(units=6, activation='linear', name='output_layer')(x)
model = Model(inputs=inputs, outputs=out)
optimizer = SGD(lr=0.1, momentum=0.4)
model.compile(optimizer=optimizer, loss='mse', metrics=['mse', 'mae'])
history =,
validation_data=(x_test, y_test),
Edit 2
I've tested the architecture with a random dataset where the input was a (6,) vector where input[i] is a random number and the output was a (6,) vector with output[i] = input[i]² and the network didn't learn anything. I've also tested a random dataset where the input was a random number and the output was a linear function of the input, and the loss converged to 0 pretty quickly. In short, it seems the simple architecture is unable to map a non-linear function.

the output of my very first layer is always zero.
This typically means that the network does not "see" any pattern in the input at all, which causes it to always predict the mean of the target over the entire training set, regardless of input. Your output is in the range of -𝜋 to 𝜋 probably with an expected value of 0, so it checks out.
My guess is that the model is too small to represent the data efficiently. I would suggest that you increase the number of parameters in the model by a factor of 10 or 100 and see if it starts seeing something. Limiting the number of parameters has a regularizing effect on the network, and strong regularization usually leads the the aforementioned derping to the mean.
I'm by no means a robotics expert, but I guess that there are a lot of situations where a small nudge in the output parameters causes a large change of the input. Let's say I'm trying to scratch my back with my left hand - the farther my hand goes to the left, the harder the task becomes, so at some point I might want to switch hands, which is a discontinuous configuration change. A bad analogy, sure, but I hope it demonstrates my hunch that there are certain places in the configuration space where small target changes cause large configuration changes.
Such large changes will cause a very large, very noisy gradient around those points. I'm not sure how well the network will work around these noisy gradients, but I would suggest as an experiment that you try to limit the training dataset to a set of outputs that are connected smoothly to one another in the configuration space of the arm, if that makes sense. Going further, you should remove any points from the dataset that are close to such configuration boundaries. To make up for that at inference time, you might instead want to sample several close-by points and choose the most common prediction as the final result. Hopefully some of those points will land in a smooth configuration area.
Also, adding batch normalization before each dense layer will help smooth the gradient and provide for more reliable training.
As for the rest of your hyperparameters:
A batch size of 32 is good, a very small batch size will make the gradient too noisy
The loss function is not critical, both MSE and MAE should work
The activation functions aren't critical, ReLU is a good default choice.
The default initializers a good enough.
Normalizing is important for Dense layers, so keep it
Train for as many epochs as you need as long as both the training and validation loss are dropping. If the validation loss hasn't dropped for 5-10 epochs you might as well stop early.
Adam is a good default choice. Start with a small learning rate and increase the learning rate at the beginning of training only if the training loss is dropping consistently over several epochs.
Further reading: 37 Reasons why your Neural Network is not working

I ended up replacing the first dense layer with a Conv1D layer and the network now seems to be learning decently. It's overfitting to my data, but that's territory I'm okay with.
I'm closing the thread for now, I'll spend some time playing with the architecture.


Neural Network Input scaling

I trained a simple fully connected network on CIFAR-10 dataset:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.fc1 = nn.Linear(3*32*32, 300, bias=False)
self.fc2 = nn.Linear(300, 10, bias=False)
def forward(self, x):
x = x.reshape(250, -1)
self.x2 = F.relu(self.fc1(x))
x = self.fc2(self.x2)
return x
def train():
# The output of torchvision datasets are PILImage images of range [0, 1].
transform = transforms.Compose([transforms.ToTensor()])
trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
trainloader =, batch_size=250, shuffle=True, num_workers=4)
testset = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)
testloader =,, shuffle=False, num_workers=4)
net = Net()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(net.parameters(), lr=0.02, momentum=0.9, weight_decay=0.0001)
for epoch in range(20):
correct = 0
total = 0
for data in trainloader:
inputs, labels = data
outputs = net(inputs)
loss = criterion(outputs, labels)
_, predicted = torch.max(, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
acc = 100. * correct / total
This network gets to ~50% test accuracy with the parameters specified, after 20 epochs.
Note that I didn't do any whitening of the inputs (no per channel mean subtraction)
Next I scaled up the model inputs by 255, by replacing outputs = net(inputs) with outputs = net(inputs*255). After this change, the network no longer converges. I looked at the gradients and they seem to grow explosively after just a few iterations, leading to all model outputs being zero. I'd like to understand why this is happening.
Also, I tried scaling down the learning rate by 255. This helps, but the network only gets to ~43% accuracy. Again, I don't understand why this helps, and more importantly why the accuracy is still degraded compared to the original settings.
EDIT: forgot to mention that I don't use biases in this network.
EDIT2: I can recover the original accuracy if I scale down the initial weights in both layers by 255 (in addition to scaling down the learning rate). I also tried to scale down the initial weights only in the first layer, but the network had trouble learning (even when I did scale down the learning rate in both layers). Then I tried scaling down the learning rate only in the first layer - this also didn't help. Finally I tried reducing learning rate in both layer even more (by 255*255) and this suddenly worked. This does not make sense to me - scaling down the initial weights by the same factor the inputs have been scaled up should have completely eliminated any difference from the original network, the input to the second layer is identical. At that point the learning rate should be scaled down in the first layer only, but in practice both layers need significantly lower learning rate...
Scaling up the inputs will lead to exploding gradients because of a few observations:
The learning rate is common to all the weights in a given update step.
Hence, the same scaling factor (ie: the learning rate) is applied to a given weight's cost derivative regardless of it's magnitude, so large and small weights get updated by the same scale.
When the loss landscape is highly erratic, this leads to exploding gradients.(like a snowball effect, one overshot update - in say, the axis of one particular weight - causes another in the opposite direction in the next update which overshoots again and so on..)
The range of values of the pixels are 0 to 255, hence scaling the data by 255 will ensure all inputs are between 0 and 1 and hence more smooth convergence as all the gradients will be uniform with respect to the learning rate. But here you scaled the learning rate which adjusts some of the problems mentioned above but is not as effective as scaling the data itself. This reduces the learning rate hence making convergence time longer, that might be the reason why it reaches 43% at 20 epochs, maybe it needs more epochs..
CIFAR-10 is a significant step up from something like the MNIST dataset, hence, fully connected neural networks do not have the representation power needed to accurately predict these images. CNNs are the way to go for any image classification task beyond MNIST. ~50% accuracy is the max you can get with a fully connected neural network unfortunately.
Maybe decrease the learning rate by 1/255 ... just a guess

Text classification issue

I'm newbie in ML and try to classify text into two categories. My dataset is made with Tokenizer from medical texts, it's unbalanced and there are 572 records for training and 471 for testing.
It's really hard for me to make model with diverse predict output, almost all values are same. I've tired using models from examples like this and to tweak parameters myself but output is always without sense
Here are tokenized and prepared data
Here is script: Gist
Sample model that I used
sequential_model = keras.Sequential([
layers.Dense(15, activation='tanh',input_dim=vocab_size),
layers.Dense(8, activation='relu'),
layers.Dense(1, activation='sigmoid')
train_history =,
validation_data=(test_data, test_labels),
class_weight={1: 1, 0: 0.2},
Unfortunately I can't share datasets.
Also I've tired to use keras.utils.to_categorical with class labels but it didn't help
Your loss curves makes sense as we see the network overfit to training set while we see the usual bowl-shaped validation curve.
To make your network perform better, you can always deepen it (more layers), widen it (more units per hidden layer) and/or add more nonlinear activation functions for your layers to be able to map to a wider range of values.
Also, I believe the reason why you originally got so many repeated values is due to the size of your network. Apparently, each of the data points has roughly 20,000 features (pretty large feature space); the size of your network is too small and the possible space of output values that can be mapped to is consequently smaller. I did some testing with some larger hidden unit layers (and bumped up the number of layers) and was able to see that the prediction values did vary: [0.519], [0.41], [0.37]...
It is also understandable that your network performance varies so because the number of features that you have is about 50 times the size of your training (usually you would like a smaller proportion). Keep in mind that training for too many epochs (like more than 10) for so small training and test dataset to see improvements in loss is not great practice as you can seriously overfit and is probably a sign that your network needs to be wider/deeper.
All of these factors, such as layer size, hidden unit size and even number of epochs can be treated as hyperparameters. In other words, hold out some percentage of your training data as part of your validation split, go one by one through the each category of factors and optimize to get the highest validation accuracy. To be fair, your training set is not too high, but I believe you should hold out some 10-20% of the training as a sort of validation set to tune these hyperparameters given that you have such a large number of features per data point. At the end of this process, you should be able to determine your true test accuracy. This is how I would optimize to get the best performance of this network. Hope this helps.
More about training, test, val split

Low accuracy of DNN created using tf.keras on dataset having small feature set

total train data record: 460000
total cross-validation data record: 89000
number of output class: 392
tensorflow 1.8.0 CPU installation
Each data record has 26 features, where 25 are numeric and one is categorical which is one hot encoded into 19 additional features. Initially, not all feature value was present for each data record. I have used avg to fill missing float type features and most frequent value for missing int type feature. Output can be any of 392 classes labeled as 0 to 391.
Finally, all features are passed through a StandardScaler()
Here is my model:
output_class = 392
X_train, X_test, y_train, y_test = get_data()
# y_train and y_test contains int from 0-391
# Make y_train and y_test categorical
y_train = tf.keras.utils.to_categorical(y_train, unique_dtc_count)
y_test = tf.keras.utils.to_categorical(y_test, unique_dtc_count)
# Convert to float type
y_train = y_train.astype(np.float32)
y_test = y_test.astype(np.float32)
# tf.enable_eager_execution() # turned off to use rmsprop optimizer
model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(400, activation=tf.nn.relu, input_shape=
model.add(tf.keras.layers.Dense(40000, activation=tf.nn.relu))
model.add(tf.keras.layers.Dense(392, activation=tf.nn.softmax))
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
import logging
logging.getLogger().setLevel(logging.INFO), y_train, epochs=3)
loss, acc = model.evaluate(X_test, y_test)
print('Accuracy', acc)
But this model gives only 28% accuracy on both on training and test data. What should I change here to get a good accuracy on both training and test data? Should I go wider and deeper? Or should I consider taking more features?
Note: there were total 400 unique features in the dataset. But most of the features only appeared randomly in 5 to 10 data record. And some features have no relevance in other data records. I picked 26 features based on domain knowledge and frequency in data records.
Any suggestion is appreciated. Thanks.
EDIT: I forgot to add this in the original post, #Neb suggested a less wide deeper network, I actually tried this. My first model was a [44,400,400,392] layer. It gave me around 30% accuracy in training and testing.
Your model is too wider. You have 400 nodes in the first hidden layer and 40.000 in the second layer, for a total of 400*44 + 40.000*400 + 392*400 = 16.174.400 parameters. However, you only input 44 features!
Because of this, your net is capable of detecting even the smallest, most imperceptible variations in inputs and finally it considers them as valuable information instead of noise. I'm quite sure that if you leave your network training for a long time (here I only see 3 epoch), it will end up with overfitting your training set.
You have some solutions:
reduce the number of nodes per levels. You may also experiment adding 1 or 2 new layers. A possible structure might be [44, 128, 512, 392]
Implement regression. You have multiple way to do this:
restrict the range the range in which network parameters live
implement Dropout
implement Batch normalization (which is known to have a small regularization effect)
use Adam Optimizer instead of RMSprop
If your features are somewhat correlated, you may try a CNN instead of a Fully connected network.
Then, to improve generalization you can:
explore the dataset looking for outliers and remove them. An outlier is a sample which can confuse the network or does not convey any additional information.
"randomly" initialize your parameters, e.g using Xavier's Initialization
Finally, I would say: do you really need 392 classes? Could you merge some of them?

Many to Many LSTM in TensorFlow : Training error not decreasing

I am trying to use train an LSTM to behave like a controller. Essential this is a many to many problem. I have 7 input features and with each feature being a sequence of 40 values. My output has two features, also being a sequence of 40 values.
I have 2 layers. First layer has four LSTM cells, and second has two LSTM cells. The code is given below.
The code runs and produces output as expected but I am unable to reduced the training error (Mean square error). The error just stops improving after the first 1000 epochs.
I tried using different batch sizes. But I am getting high error even if it the batch size is one. I tried the same network with a simple sine function, and it is working properly i.e. the error is decreasing. Is this because my sequence length is too large, due to which the vanishing gradient problem is occurring. What can I do to improve training error?
#Specify input and ouput features
Xfeatures = 7 #Number of input features
Yfeatures = 2 #Number of input features
num_steps = 40
# reset everything to rerun in jupyter
# Placeholder for the inputs in a given iteration.
u = tf.placeholder(tf.float32, [train_batch_size,num_steps,Xfeatures])
u_NN = tf.placeholder(tf.float32, [train_batch_size,num_steps,Yfeatures])
with tf.name_scope('Normalization'):
#L2 normalization for input data
Xnorm = tf.nn.l2_normalize(u_opt, 0, epsilon=1e-12, name='Normalize')
lstm1= tf.contrib.rnn.BasicLSTMCell(lstm1_size)
lstm2 = tf.contrib.rnn.BasicLSTMCell(lstm2_size)
stacked_lstm = tf.contrib.rnn.MultiRNNCell([lstm1, lstm2])
LSTM_outputs, states = tf.nn.dynamic_rnn(stacked_lstm, Xnorm, dtype=tf.float32)
mean_square_error = tf.losses.mean_squared_error(u_NN,LSTM_outputs)
train_step = tf.train.AdamOptimizer(learning_rate).minimize(mean_square_error)
#Initialization and training session
init = tf.global_variables_initializer()
with tf.Session() as sess:
for i in range(training_epochs):[train_step],feed_dict={u_opt:InputX1,u_NN:InputY1})
if i%display_epoch ==0:
print("Training loss is:",[mean_square_error],feed_dict={u_opt:InputX1,u_NN:InputY1}),"at itertion:",i)
What do you mean with: "First layer has four LSTM cells, and second has two LSTM cells. The code is given below"? Probably you intend the states of the cells.
Your code is not complete but I can try give you some advices.
If your training error is not going down, a possibility is that your net is not well dimensioned. Probably your lstm1_size and lstm2_size are not enough large to capture the characteristics of your data.
LSTMs help you in accumulating the past of a given sequences in a state vector. Usually, the state vector is not used itself as the predictor but it is projected to the output space using a standard feedforward layer. Probably you can just keep a single layer of recursion (a single LSTM layer) and than project the outputs of the layer using a feedforward layer (i.e. g(W*LSTM_outputs+b), where g is a non-linear activation).

Using squared difference of two images as loss function in tensorflow

I'm trying to use the SSD between two images as loss function for my network.
# h_fc2 is my output layer, y_ is my label image.
ssd = tf.reduce_sum(tf.square(y_ - h_fc2))
train_step = tf.train.GradientDescentOptimizer(0.01).minimize(ssd)
Problem is, that the weights then diverge and I get the error
ReluGrad input is not finite. : Tensor had Inf values
Why's that? I did try some other stuff like normalizing the ssd by the image size (did not work) or cropping the output values to 1 (does not crash anymore, but I still need to evaluate this):
ssd_min_1 = tf.reduce_sum(tf.square(y_ - tf.minimum(h_fc2, 1)))
train_step = tf.train.GradientDescentOptimizer(0.01).minimize(ssd_min_1)
Are my observations to be expected?
#mdaoust suggestions proved to be correct. The main point was normalizing by batch size. This can be done independent of batch size by using this code
squared_diff_image = tf.square(label_image - output_img)
# Sum over all dimensions except the first (the batch-dimension).
ssd_images = tf.reduce_sum(squared_diff_image, [1, 2, 3])
# Take mean ssd over batch.
error_images = tf.reduce_mean(ssd_images)
With this change, only a slight decrease of the learning rate (to 0.0001) was necessary.
There are a lot of ways you can end up with non-finite results.
But optimizers, especially simple ones like gradient descent, can diverge if the learning rate is 'too high'.
Have you tried simply dividing your learning rate by 10/100/1000? Or normalizing by pixels*batch_size to get the average error per pixel?
Or one of the more advanced optimizers? For example tf.train.AdamOptimizer() with default options.