How the MSE loss calculated for multiple neurons in output layer - tensorflow

i have a feedforward regression network (in Keras with TensorFlow backend) with single hidden layer (30 neurons) and output layer with 2 neurons (for Imaginary and Real parts of complex signal) ...My question is how the MSE loss is calculated exactly ?
since i am getting only one number in "history object" for each epoch.
Eventually i would like to extract separate loss number per output neuron each epoch, is it possible in Keras ?

Losses are calculated for every batch pass and those are then averaged into an epoch loss which is the number you are given.
If you want to calculate loss for output neurons separately I think you will have to split your output layer into two, see image below for illustration. You can then assign a loss function for both outputs and you will have access to loss values of both neurons. Note that you will have to split your ground truth into two values as you now have two outputs instead of one.
code could look like this:
inputs = x = tf.keras.layers.Input(input_shape)
x = tf.keras.layers.Dense(30)(x)
y1 = tf.keras.layers.Dense(1)(x)
y2 = tf.keras.layers.Dense(1)(x)
model = tf.keras.Model(inputs=inputs, outputs=[y1, y2])
loss = [tf.keras.losses.MeanSquaredError(), tf.keras.losses.MeanSquaredError()]

Related

How can I properly train a model to predict a moving average using LSTM in keras?

I'm learning how to train RNN model on Keras and I was expecting that training a model to predict the Moving Average of the last N steps would be quite easy.
I have a time series with thousands of steps and I'm able to create a model and train it with batches of data.
If I train it with the following model though, the test set predictions differ a lot from real values. (batch = 30, moving average window = 10)
inputs = tf.keras.Input(shape=(batch_length, num_features))
x = tf.keras.layers.LSTM(10, return_sequences=False)(inputs)
outputs = tf.keras.layers.Dense(num_labels)(x)
model = tf.keras.Model(inputs=inputs, outputs=outputs, name="test_model")
To be able to get good predictions, I need to add another layer of TimeDistributed, getting 2D predictions instead of 1D ones (I get one prediction per each time step)
inputs = tf.keras.Input(shape=(batch_length, num_features))
x = tf.keras.layers.LSTM(10, return_sequences=True)(inputs)
x = tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(num_labels))(x)
outputs = tf.keras.layers.Dense(num_labels)(x)
model = tf.keras.Model(inputs=inputs, outputs=outputs, name="test_model")
I suggest that if your goal is to give as input the last 10 timesteps and have as a prediction the moving average to try a regressor model with Densely Connected layers rather than an RNN. (Linear activation with regularization might work well enough)
That option would be cheaper to train and run than an LSTM

Number of nodes in output later greater than number of classes in a neural network

While training a neural network, on the fashion mnist dataset, I decided to have a greater number of nodes in my output layer than the number of classes in the dataset.
The dataset has 10 classes, while I trained my network to have 15 nodes in the output layer. I also used a softmax.
Now surprisingly, this gave me an accuracy of 97% which is quite good.
This leads me to the question, what do those extra 5 nodes even mean, and what do they do here?
Why is my softmax able to work properly when the label range(0-9) isn't equal to the number of nodes(15)?
And finally, in general, what does it mean to have more nodes in your output layer than the number of classes, in a classification task?
I understand the effects of having lesser nodes than the number of classes, and also that the rule of thumb is to use number of nodes = number of classes. Yet, I've never seen someone use a greater number of nodes, and I'd like to understand why/why not.
I'm attaching some code so that the results can be reproduced. This was done using Tensorflow 2.3
import tensorflow as tf
print(tf.__version__)
mnist = tf.keras.datasets.mnist
(training_images, training_labels) , (test_images, test_labels) = mnist.load_data()
training_images = training_images/255.0
test_images = test_images/255.0
model = tf.keras.models.Sequential([tf.keras.layers.Flatten(),
tf.keras.layers.Dense(256, activation=tf.nn.relu),
tf.keras.layers.Dense(15, activation=tf.nn.softmax)])
model.compile(optimizer = 'adam',
loss = 'sparse_categorical_crossentropy',
metrics = ['accuracy'])
model.fit(training_images, training_labels, epochs=5)
model.evaluate(test_images, test_labels)
The only reason you are able to use such a configuration is because you have specified your loss function as sparse_categorical_crossentropy.
let's understand the effects of greater output nodes in forward propagation.
Consider a neural network with 2 layers.
1st layer - 6 neurons (Hidden layer)
2nd layer - 4 neurons (output layer)
You have dataset X whose shape is(100*12) ie. 12 features and 100 rows.
you have labels y whose shape is (100,) containing two unique values 0 and 1.
Therefore essentially this is a binary classification problem but we will use 4 neurons in our output layer.
Consider each neuron as a logistic regression unit. Therefore each of your neurons will 12 weights (w1, w2,.....,w12)
Why? - Because you have 12 features.
Each neuron will output a single term given by a. I will give the computation of a in two steps.
z = w1x1 + w2x2 + ........ + w12*x12 + w0 # w0 is bias
a = activation(z)
Therefore, your 1st layer will output 6 values for each row in our dataset.
So now you have a feature matrix of 100 * 6.
This is passed to the 2nd layer and the same process repeats.
So in essence you are able to complete the forward propagation step even when you have more neurons than the actual classes.
Now let's see backpropagation.
For backpropagation to exist you must be able to calculate the loss_value.
we will take a small example:
y_true has two labels as in our problem and y_pred has 4 probability values since we have 4 units in our final layer.
y_true = [0, 1]
y_pred = [[0.03, 0.90, 0.02, 0.05], [0.15, 0.02, 0.8, 0.03]]
# Using 'auto'/'sum_over_batch_size' reduction type.
scce = tf.keras.losses.SparseCategoricalCrossentropy()
scce(y_true, y_pred).numpy() # 3.7092905
How is it calculated:
( log(0.03) + log(0.02) ) / 2
So essentially we can compute the loss so we can also compute its gradients.
Therefore no problem in using backpropagation too.
Therefore our model can very well train and achieve 90 % accuracy.
So the final question, what are these extra neurons representing. ie( neuron 2 and neuron 3).
Ans - They are representing the probability of the example being of class 2 and class 3 respectively. But since the labels contain no values of class 2 and class 3 they will have zero contribution in calculating the loss value.
Note- If you encode your y_label in one-hot-encoding and use categorical_crossentropy as your loss you will encounter an error.

Tensorflow: my rnn always output same value, weights of rnn are not trained

I used tensorflow to implement a simple RNN model to learn possible trends of time series data and predict future values. However, the model always produces same values after training. Actually, the best model it got is:
y = b.
The RNN structure is:
InputLayer -> BasicRNNCell -> Dense -> OutputLayer
RNN code:
def RNN(n_timesteps, n_input, n_output, n_units):
tf.reset_default_graph()
X = tf.placeholder(dtype=tf.float32, shape=[None, n_timesteps, n_input])
cells = [tf.contrib.rnn.BasicRNNCell(num_units=n_units)]
stacked_rnn = tf.contrib.rnn.MultiRNNCell(cells)
stacked_output, states = tf.nn.dynamic_rnn(stacked_rnn, X, dtype=tf.float32)
stacked_output = tf.layers.dense(stacked_output, n_output)
return X, stacked_output
while in training, n_timesteps=1, n_input=1, n_output=1, n_units=2, learning_rate=0.0000001. And loss is calculated by mean squared error.
Input is a sequence of data in continuous days. Output is the data after the days of input.
(Maybe these are not good settings. But no matter how I change them, the results are almost same. So I just set these to help show them later.)
And I found out this is because weights and bias of BasicRNNCell are not trained. They keep same from beginning. And only the weights and bias of Dense keep changing. So in training, I got a prediction like these:
In the beginning:
loss: 1433683500.0
rnn/multi_rnn_cell/cell_0/cell0/kernel:0 [KEEP UNCHANGED]
rnn/multi_rnn_cell/cell_0/cell0/bias:0 [KEEP UNCHANGED]
dense/kernel:0 [CHANGING]
dense/bias:0 [CHANGING]
After a while:
loss: 175372340.0
rnn/multi_rnn_cell/cell_0/cell0/kernel:0 [KEEP UNCHANGED]
rnn/multi_rnn_cell/cell_0/cell0/bias:0 [KEEP UNCHANGED]
dense/kernel:0 [CHANGING]
dense/bias:0 [CHANGING]
The orange line indicates the true data, the blue line indicates results of my code. Through training, the blue line will keep going up until model gets a stable loss.
So I doubt whether I did a wrong implementation, so I generate a group of data with y = 10x + 5 for testing. This time, My model learns the correct results.
In the beginning:
In the end:
I have tried:
add more layers of both BasicRNNCell and Dense
increase rnn cell hidden num(n_units) to 128
decrease learning_rate to 1e-10
increase timesteps to 60
They all dont work.
So, my questions are:
Is it because my model is too simple? But I think the trend of my data is not so complicated to learn. At least something like y = ax + b will produce a smaller loss than y = b.
What may lead to these results?
Or how should I go on debugging?
And now, I double maybe BasicRNNCell is not fully realized, users should implement some functions of it? I have no experience with tensorflow before.
It seems your net is just not fit for that kind of data, or from another point of view, your data is badly scaled. When adding the 4 lines below after the split_data, I get some sort of learning behavior, similar to the one with the a*x+b case
data = read_data(work_dir, input_file)
plot_data(data)
input_data, output_data, n_batches = split_data(data, n_timesteps, n_input, n_output)
# scale input and output data
input_data = input_data-input_data[0]
input_data = input_data/np.max(input_data)*1000
output_data = output_data-output_data[0]
output_data = output_data/np.max(output_data)*1000

Confused usage of dropout in mini-batch gradient descent

My question is in the end.
An example CNN trained with mini-batch GD and used the dropout in the last fully-connected layer (line 60) as
fc1 = tf.layers.dropout(fc1, rate=dropout, training=is_training)
At first I thought the tf.layers.dropout or tf.nn.dropout randomly sets neurons to zero in columns. But I recently found it's not the case. The below piece of code prints what the dropout does. I used the fc0 as a 4 sample x 10 feature matrix, and the fc as the dropped out version.
import tensorflow as tf
import numpy as np
fc0 = tf.random_normal([4, 10])
fc = tf.nn.dropout(fc0, 0.5)
sess = tf.Session()
sess.run(tf.global_variables_initializer())
a, b = sess.run([fc0, fc])
np.savetxt("oo.txt", np.vstack((a, b)), fmt="%.2f", delimiter=",")
And in the output oo.txt (original matrix: line 1-4, dropped out matrix: line 5-8):
0.10,1.69,0.36,-0.53,0.89,0.71,-0.84,0.24,-0.72,-0.44
0.88,0.32,0.58,-0.18,1.57,0.04,0.58,-0.56,-0.66,0.59
-1.65,-1.68,-0.26,-0.09,-1.35,-0.21,1.78,-1.69,-0.47,1.26
-1.52,0.52,-0.99,0.35,0.90,1.17,-0.92,-0.68,-0.27,0.68
0.20,0.00,0.71,-0.00,0.00,0.00,-0.00,0.47,-0.00,-0.87
0.00,0.00,0.00,-0.00,3.15,0.07,1.16,-0.00,-1.32,0.00
-0.00,-3.36,-0.00,-0.17,-0.00,-0.42,3.57,-3.37,-0.00,2.53
-0.00,1.05,-1.99,0.00,1.80,0.00,-0.00,-0.00,-0.55,1.35
My understanding of the proper? dropout is, knocking out p% same units for each sample in a mini-batch or batch gradient descent phase, and the back-propagation updates the weights and biases of the "thinned network". However, in the implementation of the example, the neurons of each sample in one batch were randomly dropped out, as illustrated in the oo.txt line 5 to 8, and for each sample, the "thinned network" is different.
As a comparison, in a stochastic gradient descent case, samples are fed into the neural network one-by-one, and in each iteration, weights of each tf.layers.dropout introduced "thinned network" are updated.
My question is, in the mini-batch or batch training, shouldn't it be implemented to knock out same neurons for all samples in one batch? Maybe by applying one mask to all input batch samples at each iteration?
Something like:
# ones: a 1xN all 1s tensor
# mask: a 1xN 0-1 tensor, multiply fc1 by mask with broadcasting along the axis of samples
mask = tf.layers.dropout(ones, rate=dropout, training=is_training)
fc1 = tf.multiply(fc1, mask)
Now I'm thinking the dropout strategy in the example may be a weighted way of updating weights of a certain neuron, that if a neuron is kept in 1 out of 10 samples in a mini-batch, its weights will be updated by alpha * 1/10 * (y_k_hat-y_k) * x_k, compared with alpha * 1/10 * sum[(y_k_hat-y_k) * x_k] for weights of another neuron kept in all 10 samples?
the screenshot from here
Dropouts are commonly used to prevent overfitting. In this case it would be a huge weight applied to one of the neurons. By randomly making it 0 from time to time, you force the network to use more neurons in determining the outcome. For this to work well you should drop different neurons for each example so that the gradient you compute is more similar to the one you would get without the dropout.
If you were to drop the same neurons for each example in the batch, my guess is that you will have a less stable gradient (might not matter for your application).
In addition dropout up-scales the rest of the values to keep the average activation at about the same level. Without it the network would learn wrong biases or would over-saturate when you turn dropout off.
If you still want the same neurons to be dropped in the batch then apply dropout to a all 1 tensor of shape (1, num_neurons) and then multiply it with the activations.
When using dropout, you are effectively trying to estimate the average performance of the network for a randomly chosen dropout mask, using Monte-Carlo sampling (by differentiation under the integral sign, the average gradient is equal to the gradient of the average). By fixing a dropout mask for each mini-batch, you are just introducing correlation between successive gradient estimates, which increases the variance and leads to slower training.
Imagine using a different dropout-mask for each image in the mini-batch, but forming the mini-batch from k copies of the same image; it's obvious that this would be a complete waste of effort!

Many to Many LSTM in TensorFlow : Training error not decreasing

I am trying to use train an LSTM to behave like a controller. Essential this is a many to many problem. I have 7 input features and with each feature being a sequence of 40 values. My output has two features, also being a sequence of 40 values.
I have 2 layers. First layer has four LSTM cells, and second has two LSTM cells. The code is given below.
The code runs and produces output as expected but I am unable to reduced the training error (Mean square error). The error just stops improving after the first 1000 epochs.
I tried using different batch sizes. But I am getting high error even if it the batch size is one. I tried the same network with a simple sine function, and it is working properly i.e. the error is decreasing. Is this because my sequence length is too large, due to which the vanishing gradient problem is occurring. What can I do to improve training error?
#Specify input and ouput features
Xfeatures = 7 #Number of input features
Yfeatures = 2 #Number of input features
num_steps = 40
# reset everything to rerun in jupyter
tf.reset_default_graph()
# Placeholder for the inputs in a given iteration.
u = tf.placeholder(tf.float32, [train_batch_size,num_steps,Xfeatures])
u_NN = tf.placeholder(tf.float32, [train_batch_size,num_steps,Yfeatures])
with tf.name_scope('Normalization'):
#L2 normalization for input data
Xnorm = tf.nn.l2_normalize(u_opt, 0, epsilon=1e-12, name='Normalize')
lstm1= tf.contrib.rnn.BasicLSTMCell(lstm1_size)
lstm2 = tf.contrib.rnn.BasicLSTMCell(lstm2_size)
stacked_lstm = tf.contrib.rnn.MultiRNNCell([lstm1, lstm2])
print(lstm1.output_size)
print(stacked_lstm.output_size)
LSTM_outputs, states = tf.nn.dynamic_rnn(stacked_lstm, Xnorm, dtype=tf.float32)
#Loss
mean_square_error = tf.losses.mean_squared_error(u_NN,LSTM_outputs)
train_step = tf.train.AdamOptimizer(learning_rate).minimize(mean_square_error)
#Initialization and training session
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
#print(sess.run([LSTM_outputs],feed_dict={u_opt:InputX1}))
print(sess.run([mean_square_error],feed_dict={u_opt:InputX1,u_NN:InputY1}))
for i in range(training_epochs):
sess.run([train_step],feed_dict={u_opt:InputX1,u_NN:InputY1})
if i%display_epoch ==0:
print("Training loss is:",sess.run([mean_square_error],feed_dict={u_opt:InputX1,u_NN:InputY1}),"at itertion:",i)
print(sess.run([mean_square_error],feed_dict={u_opt:InputX1,u_NN:InputY1}))
print(sess.run([LSTM_outputs],feed_dict={u_opt:InputX1}))
What do you mean with: "First layer has four LSTM cells, and second has two LSTM cells. The code is given below"? Probably you intend the states of the cells.
Your code is not complete but I can try give you some advices.
If your training error is not going down, a possibility is that your net is not well dimensioned. Probably your lstm1_size and lstm2_size are not enough large to capture the characteristics of your data.
LSTMs help you in accumulating the past of a given sequences in a state vector. Usually, the state vector is not used itself as the predictor but it is projected to the output space using a standard feedforward layer. Probably you can just keep a single layer of recursion (a single LSTM layer) and than project the outputs of the layer using a feedforward layer (i.e. g(W*LSTM_outputs+b), where g is a non-linear activation).