I'm learning TensorFlow and trying to apply it on a simple linear regression problem. data is numpy.ndarray of shape [42x2].
I'm a bit puzzled why after each succesive epoch the loss is increasing. Isn't the loss expected to to go down with each successive epoch!
Here is my code (let me know, if you'd like me to share the output as well!): (Thanks a lot for taking your time to answer to it.)
1) created the placeholders for dependent / independent variables
X = tf.placeholder(tf.float32, name='X')
Y = tf.placeholder(tf.float32,name='Y')
2) created vars for weight, bias, total_loss (after each epoch)
w = tf.Variable(0.0,name='weights')
b = tf.Variable(0.0,name='bias')
3) defined loss function & optimizer
Y_pred = X * w + b
loss = tf.reduce_sum(tf.square(Y - Y_pred), name = 'loss')
optimizer = tf.train.GradientDescentOptimizer(learning_rate = 0.001).minimize(loss)
4) created summary events & event file writer
tf.summary.scalar(name = 'weight', tensor = w)
tf.summary.scalar(name = 'bias', tensor = b)
tf.summary.scalar(name = 'loss', tensor = loss)
merged = tf.summary.merge_all()
evt_file = tf.summary.FileWriter('def_g')
evt_file.add_graph(tf.get_default_graph())
5) and execute all in a session
with tf.Session() as sess1:
sess1.run(tf.variables_initializer(tf.global_variables()))
for epoch in range(10):
summary, _,l = sess1.run([merged,optimizer,loss],feed_dict={X:data[:,0],Y:data[:,1]})
evt_file.add_summary(summary,epoch+1)
evt_file.flush()
print(" new_loss: {}".format(sess1.run(loss,feed_dict={X:data[:,0],Y:data[:,1]})))
Cheers!
The short answer is that your learning rate is too big. I was able to get reasonable results by changing it from 0.001 to 0.0001, but I only used the 23 points from your second-last comment (I initially didn't notice your last comment), so using all the data might require an even lower number.
0.001 seems like a really low learning rate. However, the real problem is that your loss function is using reduce_sum instead of reduce_mean. This causes your loss to be a large number, which sends a very strong signal to the GradientDescentOptimizer, so it's overshooting despite the low learning rate. The problem would only get worse if you added more points to your training data. So use reduce_mean to get the average squared error and your algorithms will be much better behaved.
Related
(I am new to stackexchange, but I believe I have correctly classified this question. If there's something off about my question, please inform me.)
I am trying to write a machine learning algorithm that learns to move an arm by contracting muscles. I have done my best to work out every possible bug I can think of but I have come to an impasse where every individual part of the program seems to run correctly, yet the algorithm does not learn. Fundamentally, all this model is doing is finding the inverse of a function by training a neural network to said function's inputs and outputs. The only thing that makes it even remotely nontrivial is that it uses an intermediary function when calculating the loss.
Working in python with TensorFlow, we first define some constants and a function that converts deltoid and bicep muscle contractions to hand positions,
lH =1.0
lU =1.0
lCD=0.1
lHD=0.1
lHB=0.9
lUB=0.1
lD_max = lCD+lHD
lD_min = abs(lCD-lHD)
lD_diff = lD_max-lD_min
lB_max = lHB+lUB
lB_min = abs(lHB-lUB)
lB_diff = lB_max-lB_min
max_muscle_contraction = 0.9
min_muscle_contraction = 0.1
lD_min_eff = lD_min + min_muscle_contraction*lD_diff
lD_max_eff = lD_min + max_muscle_contraction*lD_diff
lB_min_eff = lB_min + min_muscle_contraction*lB_diff
lB_max_eff = lB_min + max_muscle_contraction*lB_diff
def contractionToPosition(c):
# Takes a (n, m, 2) tensor of contraction pairs and returns a (n, m, 2) tensor of the resulting positions
# Commonly takes (n, 2, 2) contraction tensors: a vector of initial and final vectors of deltoid-tricep pairs.
cosD = (lCD**2+lHD**2 - tf.math.square(c[:,:,0]))/(2*lCD*lHD)
cosD = tf.math.minimum(cosD, 2*max_muscle_contraction-1)
cosD = tf.math.maximum(cosD, 2*min_muscle_contraction-1) # Equivalent to limiting the contraction
sinD = tf.math.sqrt(1-tf.math.square(cosD))
cosB = (lHB**2+lUB**2 - tf.math.square(c[:,:,1]))/(2*lHB*lUB)
cosB = tf.math.minimum(cosB, 2*max_muscle_contraction-1)
cosB = tf.math.maximum(cosB, 2*min_muscle_contraction-1) to limiting the contraction
sinB = tf.math.sqrt(1-tf.math.square(cosB))
px = lH*cosD + lU*sinB*sinD - lU*cosB*cosD
py = -lH*sinD + lU*sinB*cosD + lU*cosB*sinD
p = tf.stack([px, py], axis=-1) # By px[i,j] being the [i,j]th px value that must be paired with the [i,j]th py value
return p
Regardless of the above values and function's validity, the algorithm should still be able to learn from it because the data itself is synthetically generated with this same function. This function is also what the neural network is (approximately) trying to invert. Note that the neural network should take in the initial position and the planned final position, returning a change in the muscles contraction. Calculating the difference in the true final positions and the planned final positions will thus require that we also know the initial contraction. Toward this, we generate the synthetic data that we will later train the algorithm on,
def generateContraction(samples): # Returns a random vector of contraction lengths
cD = tf.zeros(samples)
cD += tf.random.uniform(shape=cD.shape, minval=lD_min_eff, maxval=lD_max_eff)
cB = tf.zeros(samples)
cB += tf.random.uniform(shape=cB.shape, minval=lB_min_eff, maxval=lB_max_eff)
return tf.transpose(tf.stack([cD,cB]))
def data(samples):
ci = generateContraction(samples)
cf = generateContraction(samples)
c = tf.stack([ci,cf], axis=1)
p = contractionToPosition(c)
return p, c
sample_size = 10000
positions, contractions = data(sample_size)
initial_contractions = contractions[:,0]
final_contractions = contractions[:,1]
features = positions
labels = tf.subtract(final_contractions, initial_contractions)
initial_data = initial_contractions
I have meticulously tested the entire process of this data's construction and every step has proven accurate. We then load this raw data into a dataset for the learning algorithm,
def load_array(data_arrays, batch_size, is_train=True):
dataset = tf.data.Dataset.from_tensor_slices(data_arrays)
if is_train:
dataset = dataset.shuffle(buffer_size=1000)
dataset = dataset.batch(batch_size)
return dataset
batch_size = 64
data_iter = load_array((features, labels, initial_data), batch_size)
The network model doesn't need to be very complicated to know whether the learning algorithm works since there is no statistical error in the data. With this model, we are also intending it to act like the neural network found in the cerebellum of mammals. Specifically, this implies that for simple motions it is a shallow sequential neural network with ReLu activation. As such, we construct it fairly simply,
net = tf.keras.Sequential([
tf.keras.Input(shape=(2,2)),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(units=64,
activation='relu',
kernel_initializer=tf.keras.initializers.RandomNormal(stddev=0.1)),
tf.keras.layers.Dense(units=2,
activation='relu',
kernel_initializer=tf.keras.initializers.RandomNormal(stddev=0.1))
])
Finally, we write our learning algorithm based off the TensorFlow documentation, https://www.tensorflow.org/guide/keras/writing_a_training_loop_from_scratch . Note that we are optimizing the squared distance between the planned and actual final positions rather than optimizing the difference in the contractions. This is the only thing that makes this ever so slightly nontrivial.
loss = tf.keras.losses.MeanSquaredError()
def train(net, train_iter, loss, epochs, lr):
trainer = tf.keras.optimizers.Adam(learning_rate=lr)
params = net.trainable_variables
for epoch in range(epochs):
epochError = 0
for X, y, I in train_iter:
with tf.GradientTape() as g:
g.watch(params)
P_hat = contractionToPosition(tf.reshape(net(X, training=True) + I, (-1,1,2)))
P = contractionToPosition(tf.reshape( y + I, (-1,1,2))) # We have to reshape because of our function contractionToPosition
l = loss(P, P_hat)
epochError += l
error = l
grads = g.gradient(l, params)
trainer.apply_gradients(zip(grads,params))
print(f'epoch {epoch + 1}, '
f'loss: {epochError}')
train(net, data_iter, loss, 5, 0.05)
The result of all this, though, is a complete lack of learning. Usually the epoch loss is about 109 (which is expected for no learning) with no significant change in said loss (usually fluctuates within +/-0.7 .) If anything is at fault, I would suspect this final code snippet to be, specifically the gradient tape. I have probed every aspect of the gradient tape, however, and everything seems to be functioning correctly. Overall, I cannot think of a part of my code I have not dissected at this point so I am at a total loss here.
Any and all help is deeply appreciated!
I'm trying to run through a simple linear regression example in Tensorflow, and it appears that the training algorithm is converging to a solution, but once it gets close to the solution, it starts bouncing around and eventually blows up.
I'm passing data for a y = 2x line, so the gradient descent optimizer should be able to easily converge to a solution.
import tensorflow as tf
M = tf.Variable([0.4], dtype=tf.float32)
b = tf.Variable([-0.4], dtype=tf.float32)
x = tf.placeholder(tf.float32)
y = tf.placeholder(tf.float32)
linear_model = M * x + b
error = linear_model - y
loss = tf.square(error)
optimizer = tf.train.GradientDescentOptimizer(0.01).minimize(loss)
with tf.Session() as sess:
init = tf.global_variables_initializer()
sess.run(init)
for i in range(100):
sess.run(optimizer, {x: i, y: 2 * i})
print(sess.run([M, b]))
Here's the result. I circled the portion where it gets close to a solution. Why does the gradient descent break once it gets close to the solution, or is there's something that I'm doing wrong?
Your code feeds the training data one at a time for only one epoch. This corresponds to stochastic gradient descent, where the loss value tends to fluctuate more frequently than batch and mini-batch gradient descent during training. Moreover, since the data is fed in an increasing order of x, the gradient value also increases along with x. That is why you see larger fluctuations in the later part of an epoch.
This can happen if the learning rate is too high; try lowering it.
My guess would be that you have chosen a high learning rate. You can use grid search and find the optimal learning rate, then fit data using the optimal learning rate.
I used tensorflow to implement a simple RNN model to learn possible trends of time series data and predict future values. However, the model always produces same values after training. Actually, the best model it got is:
y = b.
The RNN structure is:
InputLayer -> BasicRNNCell -> Dense -> OutputLayer
RNN code:
def RNN(n_timesteps, n_input, n_output, n_units):
tf.reset_default_graph()
X = tf.placeholder(dtype=tf.float32, shape=[None, n_timesteps, n_input])
cells = [tf.contrib.rnn.BasicRNNCell(num_units=n_units)]
stacked_rnn = tf.contrib.rnn.MultiRNNCell(cells)
stacked_output, states = tf.nn.dynamic_rnn(stacked_rnn, X, dtype=tf.float32)
stacked_output = tf.layers.dense(stacked_output, n_output)
return X, stacked_output
while in training, n_timesteps=1, n_input=1, n_output=1, n_units=2, learning_rate=0.0000001. And loss is calculated by mean squared error.
Input is a sequence of data in continuous days. Output is the data after the days of input.
(Maybe these are not good settings. But no matter how I change them, the results are almost same. So I just set these to help show them later.)
And I found out this is because weights and bias of BasicRNNCell are not trained. They keep same from beginning. And only the weights and bias of Dense keep changing. So in training, I got a prediction like these:
In the beginning:
loss: 1433683500.0
rnn/multi_rnn_cell/cell_0/cell0/kernel:0 [KEEP UNCHANGED]
rnn/multi_rnn_cell/cell_0/cell0/bias:0 [KEEP UNCHANGED]
dense/kernel:0 [CHANGING]
dense/bias:0 [CHANGING]
After a while:
loss: 175372340.0
rnn/multi_rnn_cell/cell_0/cell0/kernel:0 [KEEP UNCHANGED]
rnn/multi_rnn_cell/cell_0/cell0/bias:0 [KEEP UNCHANGED]
dense/kernel:0 [CHANGING]
dense/bias:0 [CHANGING]
The orange line indicates the true data, the blue line indicates results of my code. Through training, the blue line will keep going up until model gets a stable loss.
So I doubt whether I did a wrong implementation, so I generate a group of data with y = 10x + 5 for testing. This time, My model learns the correct results.
In the beginning:
In the end:
I have tried:
add more layers of both BasicRNNCell and Dense
increase rnn cell hidden num(n_units) to 128
decrease learning_rate to 1e-10
increase timesteps to 60
They all dont work.
So, my questions are:
Is it because my model is too simple? But I think the trend of my data is not so complicated to learn. At least something like y = ax + b will produce a smaller loss than y = b.
What may lead to these results?
Or how should I go on debugging?
And now, I double maybe BasicRNNCell is not fully realized, users should implement some functions of it? I have no experience with tensorflow before.
It seems your net is just not fit for that kind of data, or from another point of view, your data is badly scaled. When adding the 4 lines below after the split_data, I get some sort of learning behavior, similar to the one with the a*x+b case
data = read_data(work_dir, input_file)
plot_data(data)
input_data, output_data, n_batches = split_data(data, n_timesteps, n_input, n_output)
# scale input and output data
input_data = input_data-input_data[0]
input_data = input_data/np.max(input_data)*1000
output_data = output_data-output_data[0]
output_data = output_data/np.max(output_data)*1000
I'm trying to figure out how to decrease the error in my LSTM. It's an odd use-case because rather than classifying, we are taking in short lists (up to 32 elements long) and outputting a series of real numbers, ranging from -1 to 1 - representing angles. Essentially, we want to reconstruct short protein loops from amino acid inputs.
In the past we had redundant data in our datasets, so the accuracy reported was incorrect. Since removing the redundant data our validation accuracy has gotten much worse, which suggests our network had learned to memorise the most frequent examples.
Our dataset is 10,000 items, split 70/20/10 between train, validation and test. We use a bi-directional, LSTM as follows:
x = tf.cast(tf_train_dataset, dtype=tf.float32)
output_size = FLAGS.max_cdr_length * 4
dmask = tf.placeholder(tf.float32, [None, output_size], name="dmask")
keep_prob = tf.placeholder(tf.float32, name="keepprob")
sizes = [FLAGS.lstm_size,int(math.floor(FLAGS.lstm_size/2)),int(math.floor(FLAGS.lstm_size/ 4))]
single_rnn_cell_fw = tf.contrib.rnn.MultiRNNCell( [lstm_cell(sizes[i], keep_prob, "cell_fw" + str(i)) for i in range(len(sizes))])
single_rnn_cell_bw = tf.contrib.rnn.MultiRNNCell( [lstm_cell(sizes[i], keep_prob, "cell_bw" + str(i)) for i in range(len(sizes))])
length = create_length(x)
initial_state = single_rnn_cell_fw.zero_state(FLAGS.batch_size, dtype=tf.float32)
initial_state = single_rnn_cell_bw.zero_state(FLAGS.batch_size, dtype=tf.float32)
outputs, states = tf.nn.bidirectional_dynamic_rnn(cell_fw=single_rnn_cell_fw, cell_bw=single_rnn_cell_bw, inputs=x, dtype=tf.float32, sequence_length = length)
output_fw, output_bw = outputs
states_fw, states_bw = states
output_fw = last_relevant(FLAGS, output_fw, length, "last_fw")
output_bw = last_relevant(FLAGS, output_bw, length, "last_bw")
output = tf.concat((output_fw, output_bw), axis=1, name='bidirectional_concat_outputs')
test = tf.placeholder(tf.float32, [None, output_size], name="train_test")
W_o = weight_variable([sizes[-1]*2, output_size], "weight_output")
b_o = bias_variable([output_size],"bias_output")
y_conv = tf.tanh( ( tf.matmul(output, W_o)) * dmask, name="output")
Essentially, we use 3 layers of LSTM, with 256, 128 and 64 units each. We take the last step of both the Forward and Backward passes and concatenate them together. These feed into a final, fully connected layer that presents the data in the way we need it. We use a mask to set these steps we don't need to zero.
Our cost function uses a mask again, and takes the mean of the squared difference. We build the mask from the test data. Values to ignore are set to -3.0.
def cost(goutput, gtest, gweights, FLAGS):
mask = tf.sign(tf.add(gtest,3.0))
basic_error = tf.square(gtest-goutput) * mask
basic_error = tf.reduce_sum(basic_error)
basic_error /= tf.reduce_sum(mask)
return basic_error
To train the net I've used a variety of optimizers. The lowest scores have been obtained with the AdamOptimizer. The others, such as Adagrad, Adadelta, RMSProp tend to flatline around 0.3/0.4 error which is not particularly great.
Our learning rate is 0.004, batch size of 200. We use a 0.5 probability dropout layer.
I've tried adding more layers, changing learning rates, batch sizes, even the representation of the data. I've attempted batch regularisation, L1 and L2 weight regularisation (though perhaps incorrectly) and I've even considered switching to a convnet approach instead.
Nothing seems to make any difference. What has seemed to work is changing the optimizer. Adam seems noisier as it improves, but it does get closer than the other optimizers.
We need to get down to a value much closer to 0.05 or 0.01. Sometimes the training error touches 0.09 but the validation doesn't follow. I've run this network for about 500 epochs so far (about 8 hours) and it tends to settle around 0.2 validation error.
I'm not quite sure what to attempt next. Decayed learning rate might help but I suspect there is something more fundamental I need to do. It could be something as simple as a bug in the code - I need to double check the masking,
I'm running into a weird problem with TensorFlow. I've set up a very simple classification problem, four input variables, one binary output variable, one layer of weights and bias, output goes through a sigmoid to 0 or 1.
The problem is, memory consumption is quadratic in the number of records of training data! With only 5,000 records, it's already 900 megabytes; at 10,000, it runs into a few gigabytes. Since I want to end up using at least a few tens of thousands of records, this is a problem.
It is happening specifically in the back propagation step; when I just try to evaluate the cost function, memory consumption is linear in the number of records, as expected.
Code follows. What am I doing wrong?
import numpy as np
import os
import psutil
import tensorflow as tf
process = psutil.Process(os.getpid())
sess = tf.InteractiveSession()
# Parameters
learning_rate = 0.01
random_seed = 1
tf.set_random_seed(random_seed)
# Data
data = np.loadtxt('train.csv', delimiter=',', dtype=np.float32)
train_X = data[:, :-1]
train_Y = data[:, -1]
rows = np.shape(train_X)[0]
cols = np.shape(train_X)[1]
# Inputs and outputs
X = tf.placeholder(np.float32, shape=(rows, cols))
Y = tf.placeholder(np.float32, shape=rows,)
# Weights
W = tf.Variable(tf.random_normal((cols, 1)))
b = tf.Variable(tf.random_normal(()))
# Model
p = tf.nn.sigmoid(tf.matmul(X, W) + b)
cost = tf.reduce_sum((p-Y)**2/rows)
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
tf.global_variables_initializer().run()
# Just one optimizer step is enough to demonstrate the problem
optimizer.run({X: train_X, Y: train_Y})
# Memory consumption is quadratic in number of rows
print('{0:,} bytes'.format(process.memory_info().peak_wset))
It turns out to be again the problem of shape. Using matmul the way I did there, generates output of shape (n,1). Using that in a context where shape (n,) was expected, silently generates quadratic blowup.
The solution is squeeze. Specifically, tf.squeeze(tf.matmul(X, W)).
It makes sense that memory consumption blows up like that since the backprop requires the extra memory to keep track of the gradients of each operation (though I can't figure out how it ends up being quadratic).
Solution : Mini-batches
This is usually the goto method when it comes to training models. Split up your training data into little mini-batches each containing a fixed number of samples (this is rarely more than 200 samples) at feed it to the optimizer one mini-batch at a time. So if your batch_size=64 then the train_X and train_Y fed to the optimizer will be of the shapes (64, 4) and (64,) respectively.
I would try something like this
batch_size = 64
for i in range(rows):
batch_X = train_X[i*batch_size : (i + 1)*batch_size]
batch_Y = train_Y[i*batch_size : (i + 1)*batch_size]
optimizer.run({X: batch_X, Y:batch_Y})