I am trying to use train an LSTM to behave like a controller. Essential this is a many to many problem. I have 7 input features and with each feature being a sequence of 40 values. My output has two features, also being a sequence of 40 values.
I have 2 layers. First layer has four LSTM cells, and second has two LSTM cells. The code is given below.
The code runs and produces output as expected but I am unable to reduced the training error (Mean square error). The error just stops improving after the first 1000 epochs.
I tried using different batch sizes. But I am getting high error even if it the batch size is one. I tried the same network with a simple sine function, and it is working properly i.e. the error is decreasing. Is this because my sequence length is too large, due to which the vanishing gradient problem is occurring. What can I do to improve training error?
#Specify input and ouput features
Xfeatures = 7 #Number of input features
Yfeatures = 2 #Number of input features
num_steps = 40
# reset everything to rerun in jupyter
tf.reset_default_graph()
# Placeholder for the inputs in a given iteration.
u = tf.placeholder(tf.float32, [train_batch_size,num_steps,Xfeatures])
u_NN = tf.placeholder(tf.float32, [train_batch_size,num_steps,Yfeatures])
with tf.name_scope('Normalization'):
#L2 normalization for input data
Xnorm = tf.nn.l2_normalize(u_opt, 0, epsilon=1e-12, name='Normalize')
lstm1= tf.contrib.rnn.BasicLSTMCell(lstm1_size)
lstm2 = tf.contrib.rnn.BasicLSTMCell(lstm2_size)
stacked_lstm = tf.contrib.rnn.MultiRNNCell([lstm1, lstm2])
print(lstm1.output_size)
print(stacked_lstm.output_size)
LSTM_outputs, states = tf.nn.dynamic_rnn(stacked_lstm, Xnorm, dtype=tf.float32)
#Loss
mean_square_error = tf.losses.mean_squared_error(u_NN,LSTM_outputs)
train_step = tf.train.AdamOptimizer(learning_rate).minimize(mean_square_error)
#Initialization and training session
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
#print(sess.run([LSTM_outputs],feed_dict={u_opt:InputX1}))
print(sess.run([mean_square_error],feed_dict={u_opt:InputX1,u_NN:InputY1}))
for i in range(training_epochs):
sess.run([train_step],feed_dict={u_opt:InputX1,u_NN:InputY1})
if i%display_epoch ==0:
print("Training loss is:",sess.run([mean_square_error],feed_dict={u_opt:InputX1,u_NN:InputY1}),"at itertion:",i)
print(sess.run([mean_square_error],feed_dict={u_opt:InputX1,u_NN:InputY1}))
print(sess.run([LSTM_outputs],feed_dict={u_opt:InputX1}))
What do you mean with: "First layer has four LSTM cells, and second has two LSTM cells. The code is given below"? Probably you intend the states of the cells.
Your code is not complete but I can try give you some advices.
If your training error is not going down, a possibility is that your net is not well dimensioned. Probably your lstm1_size and lstm2_size are not enough large to capture the characteristics of your data.
LSTMs help you in accumulating the past of a given sequences in a state vector. Usually, the state vector is not used itself as the predictor but it is projected to the output space using a standard feedforward layer. Probably you can just keep a single layer of recursion (a single LSTM layer) and than project the outputs of the layer using a feedforward layer (i.e. g(W*LSTM_outputs+b), where g is a non-linear activation).
Related
While training a neural network, on the fashion mnist dataset, I decided to have a greater number of nodes in my output layer than the number of classes in the dataset.
The dataset has 10 classes, while I trained my network to have 15 nodes in the output layer. I also used a softmax.
Now surprisingly, this gave me an accuracy of 97% which is quite good.
This leads me to the question, what do those extra 5 nodes even mean, and what do they do here?
Why is my softmax able to work properly when the label range(0-9) isn't equal to the number of nodes(15)?
And finally, in general, what does it mean to have more nodes in your output layer than the number of classes, in a classification task?
I understand the effects of having lesser nodes than the number of classes, and also that the rule of thumb is to use number of nodes = number of classes. Yet, I've never seen someone use a greater number of nodes, and I'd like to understand why/why not.
I'm attaching some code so that the results can be reproduced. This was done using Tensorflow 2.3
import tensorflow as tf
print(tf.__version__)
mnist = tf.keras.datasets.mnist
(training_images, training_labels) , (test_images, test_labels) = mnist.load_data()
training_images = training_images/255.0
test_images = test_images/255.0
model = tf.keras.models.Sequential([tf.keras.layers.Flatten(),
tf.keras.layers.Dense(256, activation=tf.nn.relu),
tf.keras.layers.Dense(15, activation=tf.nn.softmax)])
model.compile(optimizer = 'adam',
loss = 'sparse_categorical_crossentropy',
metrics = ['accuracy'])
model.fit(training_images, training_labels, epochs=5)
model.evaluate(test_images, test_labels)
The only reason you are able to use such a configuration is because you have specified your loss function as sparse_categorical_crossentropy.
let's understand the effects of greater output nodes in forward propagation.
Consider a neural network with 2 layers.
1st layer - 6 neurons (Hidden layer)
2nd layer - 4 neurons (output layer)
You have dataset X whose shape is(100*12) ie. 12 features and 100 rows.
you have labels y whose shape is (100,) containing two unique values 0 and 1.
Therefore essentially this is a binary classification problem but we will use 4 neurons in our output layer.
Consider each neuron as a logistic regression unit. Therefore each of your neurons will 12 weights (w1, w2,.....,w12)
Why? - Because you have 12 features.
Each neuron will output a single term given by a. I will give the computation of a in two steps.
z = w1x1 + w2x2 + ........ + w12*x12 + w0 # w0 is bias
a = activation(z)
Therefore, your 1st layer will output 6 values for each row in our dataset.
So now you have a feature matrix of 100 * 6.
This is passed to the 2nd layer and the same process repeats.
So in essence you are able to complete the forward propagation step even when you have more neurons than the actual classes.
Now let's see backpropagation.
For backpropagation to exist you must be able to calculate the loss_value.
we will take a small example:
y_true has two labels as in our problem and y_pred has 4 probability values since we have 4 units in our final layer.
y_true = [0, 1]
y_pred = [[0.03, 0.90, 0.02, 0.05], [0.15, 0.02, 0.8, 0.03]]
# Using 'auto'/'sum_over_batch_size' reduction type.
scce = tf.keras.losses.SparseCategoricalCrossentropy()
scce(y_true, y_pred).numpy() # 3.7092905
How is it calculated:
( log(0.03) + log(0.02) ) / 2
So essentially we can compute the loss so we can also compute its gradients.
Therefore no problem in using backpropagation too.
Therefore our model can very well train and achieve 90 % accuracy.
So the final question, what are these extra neurons representing. ie( neuron 2 and neuron 3).
Ans - They are representing the probability of the example being of class 2 and class 3 respectively. But since the labels contain no values of class 2 and class 3 they will have zero contribution in calculating the loss value.
Note- If you encode your y_label in one-hot-encoding and use categorical_crossentropy as your loss you will encounter an error.
I've been running into an issue lately trying to train a simple MLP.
I'm basically trying to get a network to map the XYZ position and RPY orientation of the end-effector of a robot arm (6-dimensional input) to the angle of every joint of the robot arm to reach that position (6-dimensional output), so this is a regression problem.
I've generated a dataset using the angles to compute the current position, and generated datasets with 5k, 500k and 500M sets of values.
My issue is the MLP I'm using doesn't learn anything at all. Using Tensorboard (I'm using Keras), I've realized that the output of my very first layer is always zero (see image 1), no matter what I try.
Basically, my input is a shape (6,) vector and the output is also a shape (6,) vector.
Here is what I've tried so far, without success:
I've tried MLPs with 2 layers of size 12, 24; 2 layers of size 48, 48; 4 layers of size 12, 24, 24, 48.
Adam, SGD, RMSprop optimizers
Learning rates ranging from 0.15 to 0.001, with and without decay
Both Mean Squared Error (MSE) and Mean Absolute Error (MAE) as the loss function
Normalizing the input data, and not normalizing it (the first 3 values are between -3 and +3, the last 3 are between -pi and pi)
Batch sizes of 1, 10, 32
Tested the MLP of all 3 datasets of 5k values, 500k values and 5M values.
Tested with number of epoches ranging from 10 to 1000
Tested multiple initializers for the bias and kernel.
Tested both the Sequential model and the Keras functional API (to make sure the issue wasn't how I called the model)
All 3 of sigmoid, relu and tanh activation functions for the hidden layers (the last layer is a linear activation because its a regression)
Additionally, I've tried the very same MLP architecture on the basic Boston housing price regression dataset by Keras, and the net was definitely learning something, which leads me to believe that there may be some kind of issue with my data. However, I'm at a complete loss as to what it may be as the system in its current state does not learn anything at all, the loss function just stalls starting on the 1st epoch.
Any help or lead would be appreciated, and I will gladly provide code or data if needed!
Thank you
EDIT:
Here's a link to 5k samples of the data I'm using. Columns B-G are the output (angles used to generate the position/orientation) and columns H-M are the input (XYZ position and RPY orientation). https://drive.google.com/file/d/18tQJBQg95ISpxF9T3v156JAWRBJYzeiG/view
Also, here's a snippet of the code I'm using:
df = pd.read_csv('kinova_jaco_data_5k.csv', names = ['state0',
'state1',
'state2',
'state3',
'state4',
'state5',
'pose0',
'pose1',
'pose2',
'pose3',
'pose4',
'pose5'])
states = np.asarray(
[df.state0.to_numpy(), df.state1.to_numpy(), df.state2.to_numpy(), df.state3.to_numpy(), df.state4.to_numpy(),
df.state5.to_numpy()]).transpose()
poses = np.asarray(
[df.pose0.to_numpy(), df.pose1.to_numpy(), df.pose2.to_numpy(), df.pose3.to_numpy(), df.pose4.to_numpy(),
df.pose5.to_numpy()]).transpose()
x_train_temp, x_test, y_train_temp, y_test = train_test_split(poses, states, test_size=0.2)
x_train, x_val, y_train, y_val = train_test_split(x_train_temp, y_train_temp, test_size=0.2)
mean = x_train.mean(axis=0)
x_train -= mean
std = x_train.std(axis=0)
x_train /= std
x_test -= mean
x_test /= std
x_val -= mean
x_val /= std
n_epochs = 100
n_hidden_layers=2
n_units=[48, 48]
inputs = Input(shape=(6,), dtype= 'float32', name = 'input')
x = Dense(units=n_units[0], activation=relu, name='dense1')(inputs)
for i in range(1, n_hidden_layers):
x = Dense(units=n_units[i], activation=activation, name='dense'+str(i+1))(x)
out = Dense(units=6, activation='linear', name='output_layer')(x)
model = Model(inputs=inputs, outputs=out)
optimizer = SGD(lr=0.1, momentum=0.4)
model.compile(optimizer=optimizer, loss='mse', metrics=['mse', 'mae'])
history = model.fit(x_train,
y_train,
epochs=n_epochs,
verbose=1,
validation_data=(x_test, y_test),
batch_size=32)
Edit 2
I've tested the architecture with a random dataset where the input was a (6,) vector where input[i] is a random number and the output was a (6,) vector with output[i] = input[i]² and the network didn't learn anything. I've also tested a random dataset where the input was a random number and the output was a linear function of the input, and the loss converged to 0 pretty quickly. In short, it seems the simple architecture is unable to map a non-linear function.
the output of my very first layer is always zero.
This typically means that the network does not "see" any pattern in the input at all, which causes it to always predict the mean of the target over the entire training set, regardless of input. Your output is in the range of -𝜋 to 𝜋 probably with an expected value of 0, so it checks out.
My guess is that the model is too small to represent the data efficiently. I would suggest that you increase the number of parameters in the model by a factor of 10 or 100 and see if it starts seeing something. Limiting the number of parameters has a regularizing effect on the network, and strong regularization usually leads the the aforementioned derping to the mean.
I'm by no means a robotics expert, but I guess that there are a lot of situations where a small nudge in the output parameters causes a large change of the input. Let's say I'm trying to scratch my back with my left hand - the farther my hand goes to the left, the harder the task becomes, so at some point I might want to switch hands, which is a discontinuous configuration change. A bad analogy, sure, but I hope it demonstrates my hunch that there are certain places in the configuration space where small target changes cause large configuration changes.
Such large changes will cause a very large, very noisy gradient around those points. I'm not sure how well the network will work around these noisy gradients, but I would suggest as an experiment that you try to limit the training dataset to a set of outputs that are connected smoothly to one another in the configuration space of the arm, if that makes sense. Going further, you should remove any points from the dataset that are close to such configuration boundaries. To make up for that at inference time, you might instead want to sample several close-by points and choose the most common prediction as the final result. Hopefully some of those points will land in a smooth configuration area.
Also, adding batch normalization before each dense layer will help smooth the gradient and provide for more reliable training.
As for the rest of your hyperparameters:
A batch size of 32 is good, a very small batch size will make the gradient too noisy
The loss function is not critical, both MSE and MAE should work
The activation functions aren't critical, ReLU is a good default choice.
The default initializers a good enough.
Normalizing is important for Dense layers, so keep it
Train for as many epochs as you need as long as both the training and validation loss are dropping. If the validation loss hasn't dropped for 5-10 epochs you might as well stop early.
Adam is a good default choice. Start with a small learning rate and increase the learning rate at the beginning of training only if the training loss is dropping consistently over several epochs.
Further reading: 37 Reasons why your Neural Network is not working
I ended up replacing the first dense layer with a Conv1D layer and the network now seems to be learning decently. It's overfitting to my data, but that's territory I'm okay with.
I'm closing the thread for now, I'll spend some time playing with the architecture.
I'm trying to compute the gradient of the output layer with respect to the input layer. My neural network is relatively small (input layer composed of 9 activation units and the output layer of 1) and the training went fine as the test provided a very good accuracy. I made the NN model using Keras.
In order to solve my problem, I need to compute the gradient of the output with respect to the input. This is, I need to obtain the Jacobian which as dimension [1x9]. The gradients function in tensorflow should provide me with everything I need, but when I run the code below I obtain a different solution every time.
output_v = model.output
input_v = model.input
gradients = tf.gradients(output_v, input_v)
sess = tf.Session()
sess.run(tf.initialize_all_variables())
print(sess.run(model.input,feed_dict={model.input:x_test_N[0:1,:]}))
evaluated_gradients = sess.run(gradients,feed_dict{model.input:x_test_N[0:1,:]})
print(evaluated_gradients)
sess.close()
The first print command shows this value every time I run it (just to make sure that the input values are not modified):
[[-1.4306372 -0.1272892 0.7145787 1.338818 -1.2957293 -0.5402862-0.7771702 -0.5787912 -0.9157122]]
But the second print shows different ones:
[[ 0.00175761, -0.0490326 , -0.05413761, 0.09952173, 0.06112418, -0.04772799, 0.06557006, -0.02473242, 0.05542536]]
[[-0.00416433, 0.08235116, -0.00930298, 0.04440641, 0.03752216, 0.06378302, 0.03508484, -0.01903783, -0.0538374 ]]
Using finite differences, evaluated_gradients[0,0] = 0.03565103, which isn't close to any of the first values previously printed.
Thanks for your time!
Alberto
Solved by creating a specific session just before training my model:
sess = tf.Session()
sess.run(tf.global_variables_initializer())
K.set_session(sess)
history = model.fit(x_train_N, y_train_N, epochs=n_epochs,
validation_split=split, verbose=1, batch_size=n_batch_size,
shuffle='true', callbacks=[early_stop, tensorboard])
And evaluating the gradient after training, while tf.session is still open:
evaluated_gradients = sess.run(K.gradients(model.output, model.input), feed_dict={model.input: x_test_N})
Presumably your network is set up to initialize weights to random values. When you run sess.run(tf.initialize_all_variables()), you are initializing your variables to new random values. Therefore you get different values for output_v in every run, and hence different gradients. If you want to use a model you trained before, you should replace the initialization with initialize_all_variables() with a restore command. I am not familiar with how this is done in Keras since I usually work directly with tensorflow, but I would try this.
Also note that initialize_all_variables is deprecated and you should use global_variables_initializer instead.
Recently, I'm working on a project "predicting future trajectories of objects from their past trajectories by using LSTMs in Tensorflow."
(Here, a trajectory means a sequence of 2D positions.)
Input to the LSTM is, of course, 'past trajectories' and output is 'future trajectories'.
The size of mini-batch is fixed when training. However, the number of past trajectories in a mini-batch can be different. For example, let the mini-batch size be 10. If I have only 4 past trajectories for the current training iteration, 6 out of 10 in the mini-batch is padded with zero value.
When calculating the loss for the back-propagation, I let the loss from the 6 be zero so that the only 4 contribute to the back-propagation.
The problem that I concern is..it seems that Tensorflow still calculates gradients for the 6 even if their loss is zero. As a result, the training speed becomes slower as I increase the mini-batch size even if I used the same training data.
I also used tf.where function when calculating the loss. However, the training time does not decrease.
How can I reduce the training time?
Here I attached my pseudo code for training.
# For each frame in a sequence
for f in range(pred_length):
# For each element in a batch
for b in range(batch_size):
with tf.variable_scope("rnnlm") as scope:
if (f > 0 or b > 0):
scope.reuse_variables()
# for each pedestrian in an element
for p in range(MNP):
# ground-truth position
cur_gt_pose = ...
# loss mask
loss_mask_ped = ... # '1' or '0'
# go through RNN decoder
output_states_dec_list[b][p], zero_states_dec_list[b][p] = cell_dec(cur_embed_frm_dec,
zero_states_dec_list[b][p])
# fully connected layer for output
cur_pred_pose_dec = tf.nn.xw_plus_b(output_states_dec_list[b][p], output_wd, output_bd)
# go through embedding function for the next input
prev_embed_frms_dec_list[b][p] = tf.reshape(tf.nn.relu(tf.nn.xw_plus_b(cur_pred_pose_dec, embedding_wd, embedding_bd)), shape=(1, rnn_size))
# calculate MSE loss
mse_loss = tf.reduce_sum(tf.pow(tf.subtract(cur_pred_pose_dec, cur_gt_pose_dec), 2.0))
# only valid ped's traj contributes to the loss
self.loss += tf.multiply(mse_loss, loss_mask_ped)
I think you're looking for the function tf.stop_gradient. Using this, you could do something like tf.where(loss_mask, tensor, tf.stop_gradient(tensor)) to achieve the desired result, assuming that the dimensions are correct.
However, it looks like this is probably not your issue. It seems as though for each item in your dataset, you are defining new graph nodes. This is not how TensorFlow is supposed to function, you should only have one graph, built beforehand that performs some fixed function, regardless of the batch size. You should definitely not be defining new nodes for every element in the batch, since that cannot efficiently take advantage of parallelism.
I'm making my first steps learning TF and have some trouble training RNNs.
My toy problem goes like this: a two layers LSTM + dense layer network is fed with raw audio data and should test whether a certain frequency is present in the sound.
so the network should 1 to 1 map float(audio data sequence) to float(pre-chosen frequency volume)
I've got this to work on Keras and seen a similar TFLearn solution but would like to implement this on bare Tensorflow in a relatively efficient way.
what i've done:
lstm = rnn_cell.BasicLSTMCell(LSTM_SIZE,state_is_tuple=True,forget_bias=1.0)
lstm = rnn_cell.DropoutWrapper(lstm)
stacked_lstm = rnn_cell.MultiRNNCell([lstm] * 2,state_is_tuple=True)
outputs, states = rnn.dynamic_rnn(stacked_lstm, in, dtype=tf.float32)
outputs = tf.transpose(outputs, [1, 0, 2])
last = tf.gather(outputs, int(outputs.get_shape()[0]) - 1)
network= tf.matmul(last, W) + b
# cost function, optimizer etc...
during training I fed this with (BATCH_SIZE, SEQUENCE_LEN,1) batches and it seems like the loss converged correctly but I can't figure out how to predict with the trained network.
My (awful lot of) questions:
how do i make this network return a sequence right from Tensorflow without going back to python for each sample(feed a sequence and predict a sequence of the same size)?
If I do want to predict one sample at a time and iterate in python what is the correct way to do it?
During testing is dynamic_rnn needed or it's just used for unrolling for BPTT during training? why is dynamic_rnn returning all the back propagation steps Tensors? these are the outputs of each layer of the unrolled network right?
after some research:
how do i make this network return a sequence right from Tensorflow
without going back to python for each sample(feed a sequence and
predict a sequence of the same size)?
you can use state_saving_rnn
class Saver():
def __init__(self):
self.d = {}
def state(self, name):
if not name in self.d:
return tf.zeros([1,LSTM_SIZE],tf.float32)
return self.d[name]
def save_state(self, name, val):
self.d[name] = val
return tf.identity('save_state_name') #<-important for control_dependencies
outputs, states = rnn.state_saving_rnn(stacked_lstm, inx, Saver(),
('lstmstate', 'lstmstate2', 'lstmstate3', 'lstmstate4'),sequence_length=[EVAL_SEQ_LEN])
#4 states are for two layers of lstm each has hidden and CEC variables to restore
network = [tf.matmul(outputs[-1], W) for i in xrange(EVAL_SEQ_LEN)]
one problem is that state_saving_rnn is using rnn() and not dynamic_rnn() therefore unroll at compile time EVAL_SEQ_LEN steps you might want to re-implement state_saving_rnn with dynamic_rnn if you want to input long sequences
If I do want to predict one sample at a time and iterate in python what is the correct way to do it?
you can use dynamic_rnn and supply initial_state. this is probably just as efficient as state_saving_rnn. look at state_saving_rnn implementations for reference
During testing is dynamic_rnn needed or it's just used for unrolling for BPTT during training? why is dynamic_rnn returning all the back propagation steps Tensors? these are the outputs of each layer of the unrolled network right?
dynamic_rnn does do unrolling at runtime similarly to compile time rnn(). I guess it returns all the steps for you to branch the graph in some other places - after less time steps. in a network that use [one time step input * current state -> one output, new state] like the one described above it's not needed in testing but could be used for training truncated time back propagation