Here is what I'm trying to implement:
We calculate loss based on F(X), as usual. But we also define "adversarial loss" which is a loss based on F(X + e). e is defined as dF(X)/dX multiplied by some constant. Both loss and adversarial loss are backpropagated for the total loss.
In tensorflow, this part (getting dF(X)/dX) can be coded like below:
grad, = tf.gradients( loss, X )
grad = tf.stop_gradient(grad)
e = constant * grad
Below is my pytorch code:
class DocReaderModel(object):
def __init__(self, embedding=None, state_dict=None):
self.train_loss = AverageMeter()
self.embedding = embedding
self.network = DNetwork(opt, embedding)
self.optimizer = optim.SGD(parameters)
def adversarial_loss(self, batch, loss, embedding, y):
self.optimizer.zero_grad()
loss.backward(retain_graph=True)
grad = embedding.grad
grad.detach_()
perturb = F.normalize(grad, p=2)* 0.5
self.optimizer.zero_grad()
adv_embedding = embedding + perturb
network_temp = DNetwork(self.opt, adv_embedding) # This is how to get F(X)
network_temp.training = False
network_temp.cuda()
start, end, _ = network_temp(batch) # This is how to get F(X)
del network_temp # I even deleted this instance.
return F.cross_entropy(start, y[0]) + F.cross_entropy(end, y[1])
def update(self, batch):
self.network.train()
start, end, pred = self.network(batch)
loss = F.cross_entropy(start, y[0]) + F.cross_entropy(end, y[1])
loss_adv = self.adversarial_loss(batch, loss, self.network.lexicon_encoder.embedding.weight, y)
loss_total = loss + loss_adv
self.optimizer.zero_grad()
loss_total.backward()
self.optimizer.step()
I have few questions:
1) I substituted tf.stop_gradient with grad.detach_(). Is this correct?
2) I was getting "RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time." so I added retain_graph=True at the loss.backward. That specific error went away.
However now I'm getting a memory error after few epochs (RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1525909934016/work/aten/src/THC/generic/THCStorage.cu:58
). I suspect I'm unnecessarily retaining graph.
Can someone let me know pytorch's best practice on this? Any hint / even short comment will be highly appreciated.
I think you are trying to implement generative adversarial network (GAN), but from the code, I don't understand and can't follow to what you are trying to achieve as there are a few missing pieces for a GAN to works. I can see there's a discriminator network module, DNetwork but missing the generator network module.
If to guess, when you say 'loss function twice', I assumed you mean you have one loss function for the discriminator net and another for the generator net. If that's the case, let me share how I would implement a basic GAN model.
As an example, let's take a look at this Wasserstein GAN Jupyter notebook
I'll skip the less important bits and zoom into the important ones here:
First, import PyTorch libraries and set up
# Set up batch size, image size, and size of noise vector:
bs, sz, nz = 64, 64, 100 # nz is the size of the latent z vector for creating some random noise later
Build a discriminator module
class DCGAN_D(nn.Module):
def __init__(self):
... truncated, the usual neural nets stuffs, layers, etc ...
def forward(self, input):
... truncated, the usual neural nets stuffs, layers, etc ...
Build a generator module
class DCGAN_G(nn.Module):
def __init__(self):
... truncated, the usual neural nets stuffs, layers, etc ...
def forward(self, input):
... truncated, the usual neural nets stuffs, layers, etc ...
Put them all together
netG = DCGAN_G().cuda()
netD = DCGAN_D().cuda()
Optimizer needs to be told what variables to optimize. A module automatically keeps track of its variables.
optimizerD = optim.RMSprop(netD.parameters(), lr = 1e-4)
optimizerG = optim.RMSprop(netG.parameters(), lr = 1e-4)
One forward step and one backward step for Discriminator
Here, the network can calculate gradient during the backward pass, depends on the input to this function. So, in my case, I have 3 type of losses; generator loss, dicriminator real image loss, dicriminator fake image loss. I can get gradient of loss function three times for 3 different net passes.
def step_D(input, init_grad):
# input can be from generator's generated image data or input image from dataset
err = netD(input)
err.backward(init_grad) # backward pass net to calculate gradient
return err # loss
Control trainable parameters [IMPORTANT]
Trainable parameters in the model are those that require gradients.
def make_trainable(net, val):
for p in net.parameters():
p.requires_grad = val # note, i.e, this is later set to False below in netG update in the train loop.
In TensorFlow, this part can be coded like below:
grad = tf.gradients(loss, X)
grad = tf.stop_gradient(grad)
So, I think this will answer your first question, "I substituted tf.stop_gradient with grad.detach_(). Is this correct?"
Train loop
You can see here how's the 3 different loss functions are being called here.
def train(niter, first=True):
for epoch in range(niter):
# Make iterable from PyTorch DataLoader
data_iter = iter(dataloader)
i = 0
while i < n:
###########################
# (1) Update D network
###########################
make_trainable(netD, True)
# train the discriminator d_iters times
d_iters = 100
j = 0
while j < d_iters and i < n:
j += 1
i += 1
# clamp parameters to a cube
for p in netD.parameters():
p.data.clamp_(-0.01, 0.01)
data = next(data_iter)
##### train with real #####
real_cpu, _ = data
real_cpu = real_cpu.cuda()
real = Variable( data[0].cuda() )
netD.zero_grad()
# Real image discriminator loss
errD_real = step_D(real, one)
##### train with fake #####
fake = netG(create_noise(real.size()[0]))
input.data.resize_(real.size()).copy_(fake.data)
# Fake image discriminator loss
errD_fake = step_D(input, mone)
# Discriminator loss
errD = errD_real - errD_fake
optimizerD.step()
###########################
# (2) Update G network
###########################
make_trainable(netD, False)
netG.zero_grad()
# Generator loss
errG = step_D(netG(create_noise(bs)), one)
optimizerG.step()
print('[%d/%d][%d/%d] Loss_D: %f Loss_G: %f Loss_D_real: %f Loss_D_fake %f'
% (epoch, niter, i, n,
errD.data[0], errG.data[0], errD_real.data[0], errD_fake.data[0]))
"I was getting "RuntimeError: Trying to backward through the graph a second time..."
PyTorch has this behaviour; to reduce GPU memory usage, during the .backward() call, all the intermediary results (if you have like saved activations, etc.) are deleted when they are not needed anymore. Therefore, if you try to call .backward() again, the intermediary results don't exist and the backward pass cannot be performed (and you get the error you see).
It depends on what you are trying to do. You can call .backward(retain_graph=True) to make a backward pass that will not delete intermediary results, and so you will be able to call .backward() again. All but the last call to backward should have the retain_graph=True option.
Can someone let me know pytorch's best practice on this
As you can see from the PyTorch code above and from the way things are being done in PyTorch which is trying to stay Pythonic, you can get a sense of PyTorch's best practice there.
If you want to work with higher-order derivatives (i.e. a derivative of a derivative) take a look at the create_graph option of backward.
For example:
loss = get_loss()
loss.backward(create_graph=True)
loss_grad_penalty = loss + loss.grad
loss_grad_penalty.backward()
Related
I am trying to find out, how exactly does BatchNormalization layer behave in TensorFlow. I came up with the following piece of code which to the best of my knowledge should be a perfectly valid keras model, however the mean and variance of BatchNormalization doesn't appear to be updated.
From docs https://www.tensorflow.org/api_docs/python/tf/keras/layers/BatchNormalization
in the case of the BatchNormalization layer, setting trainable = False on the layer means that the layer will be subsequently run in inference mode (meaning that it will use the moving mean and the moving variance to normalize the current batch, rather than using the mean and variance of the current batch).
I expect the model to return a different value with each subsequent predict call.
What I see, however, are the exact same values returned 10 times.
Can anyone explain to me why does the BatchNormalization layer not update its internal values?
import tensorflow as tf
import numpy as np
if __name__ == '__main__':
np.random.seed(1)
x = np.random.randn(3, 5) * 5 + 0.3
bn = tf.keras.layers.BatchNormalization(trainable=False, epsilon=1e-9)
z = input = tf.keras.layers.Input([5])
z = bn(z)
model = tf.keras.Model(inputs=input, outputs=z)
for i in range(10):
print(x)
print(model.predict(x))
print()
I use TensorFlow 2.1.0
Okay, I found the mistake in my assumptions. The moving average is being updated during training not during inference as I thought. This makes perfect sense, as updating the moving averages during inference would likely result in an unstable production model (for example a long sequence of highly pathological input samples [e.g. such that their generating distribution differs drastically from the one on which the network was trained] could potentially bias the network and result in worse performance on valid input samples).
The trainable parameter is useful when you're fine-tuning a pretrained model and want to freeze some of the layers of the network even during training. Because when you call model.predict(x) (or even model(x) or model(x, training=False)), the layer automatically uses the moving averages instead of batch averages.
The code below demonstrates this clearly
import tensorflow as tf
import numpy as np
if __name__ == '__main__':
np.random.seed(1)
x = np.random.randn(10, 5) * 5 + 0.3
z = input = tf.keras.layers.Input([5])
z = tf.keras.layers.BatchNormalization(trainable=True, epsilon=1e-9, momentum=0.99)(z)
model = tf.keras.Model(inputs=input, outputs=z)
# a dummy loss function
model.compile(loss=lambda x, y: (x - y) ** 2)
# a dummy fit just to update the batchnorm moving averages
model.fit(x, x, batch_size=3, epochs=10)
# first predict uses the moving averages from training
pred = model(x).numpy()
print(pred.mean(axis=0))
print(pred.var(axis=0))
print()
# outputs the same thing as previous predict
pred = model(x).numpy()
print(pred.mean(axis=0))
print(pred.var(axis=0))
print()
# here calling the model with training=True results in update of moving averages
# furthermore, it uses the batch mean and variance as in training,
# so the result is very different
pred = model(x, training=True).numpy()
print(pred.mean(axis=0))
print(pred.var(axis=0))
print()
# here we see again that the moving averages are used but they differ slightly after
# the previous call, as expected
pred = model(x).numpy()
print(pred.mean(axis=0))
print(pred.var(axis=0))
print()
In the end, I found that the documentation (https://www.tensorflow.org/api_docs/python/tf/keras/layers/BatchNormalization) mentions this:
When performing inference using a model containing batch normalization, it is generally (though not always) desirable to use accumulated statistics rather than mini-batch statistics. This is accomplished by passing training=False when calling the model, or using model.predict.
Hopefully this will help someone with similar misunderstanding in the future.
I am building a deep regression network (CNN) to predict a (1000,1) target vector from images (7,11). The target usually consists of about 90 % zeros and only 10 % non-zero values. The distribution of (non-) zero values in the targets vary from sample to sample (i.e. there is no global class imbalance).
Using mean sqaured error loss, this led to the network predicting only zeros, which I don't find surprising.
My best guess is to write a custom loss function that penalizes errors regarding non-zero values more than the prediction of zero-values.
I have tried this loss function with the intend to implement what I have guessed could work above. It is a mean squared error loss in which the predictions of non-zero targets are penalized less (w=0.1).
def my_loss(y_true, y_pred):
# weights true zero predictions less than true nonzero predictions
w = 0.1
y_pred_of_nonzeros = tf.where(tf.equal(y_true, 0), y_pred-y_pred, y_pred)
return K.mean(K.square(y_true-y_pred_of_nonzeros)) + K.mean(K.square(y_true-y_pred))*w
The network is able to learn without getting stuck with only-zero predictions. However, this solution seems quite unclean. Is there a better way to deal with this type of problem? Any advice on improving the custom loss function?
Any suggestions are welcome, thank you in advance!
Best,
Lukas
Not sure there is anything better than a custom loss just like you did, but there is a cleaner way:
def weightedLoss(w):
def loss(true, pred):
error = K.square(true - pred)
error = K.switch(K.equal(true, 0), w * error , error)
return error
return loss
You may also return K.mean(error), but without mean you can still profit from other Keras options like adding sample weights and other things.
Select the weight when compiling:
model.compile(loss = weightedLoss(0.1), ...)
If you have the entire data in an array, you can do:
w = K.mean(y_train)
w = w / (1 - w) #this line compesates the lack of the 90% weights for class 1
Another solution that can avoid using a custom loss, but requires changes in the data and the model is:
Transform your y into a 2-class problem for each output. Shape = (batch, originalClasses, 2).
For the zero values, make the first of the two classes = 1
For the one values, make the second of the two classes = 1
newY = np.stack([1-oldY, oldY], axis=-1)
Adjust the model to output this new shape.
...
model.add(Dense(2*classes))
model.add(Reshape((classes,2)))
model.add(Activation('softmax'))
Make sure you are using a softmax and a categorical_crossentropy as loss.
Then use the argument class_weight={0: w, 1: 1} in fit.
I'm trying to do Stanfords CS20: TensorFlow for Deep Learning Research course. The first 2 lectures provide a good introduction to the low level plumbing and computation framework (that frankly the official introductory tutorials seem to skip right over for reasons I can only fathom as sadism). In lecture 3, it starts performing a linear regression and makes what seems like a fairly heavy cognitive leap for me. Instead of session.run on a tensor computation, it does it on the GradientDescentOptimizer.
sess.run(optimizer, feed_dict={X: x, Y:y})
The full code is available on page 3 of the lecture 3 notes.
EDIT: code and data also available at this github - code is available in examples/03_linreg_placeholder.py and data in examples/data/birth_life_2010.txt
EDIT: code is below as per request
import tensorflow as tf
import utils
DATA_FILE = "data/birth_life_2010.f[txt"
# Step 1: read in data from the .txt file
# data is a numpy array of shape (190, 2), each row is a datapoint
data, n_samples = utils.read_birth_life_data(DATA_FILE)
# Step 2: create placeholders for X (birth rate) and Y (life expectancy)
X = tf.placeholder(tf.float32, name='X')
Y = tf.placeholder(tf.float32, name='Y')
# Step 3: create weight and bias, initialized to 0
w = tf.get_variable('weights', initializer=tf.constant(0.0))
b = tf.get_variable('bias', initializer=tf.constant(0.0))
# Step 4: construct model to predict Y (life expectancy from birth rate)
Y_predicted = w * X + b
# Step 5: use the square error as the loss function
loss = tf.square(Y - Y_predicted, name='loss')
# Step 6: using gradient descent with learning rate of 0.01 to minimize loss
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001).minimize(loss)
with tf.Session() as sess:
# Step 7: initialize the necessary variables, in this case, w and b
sess.run(tf.global_variables_initializer())
# Step 8: train the model
for i in range(100): # run 100 epochs
for x, y in data:
# Session runs train_op to minimize loss
sess.run(optimizer, feed_dict={X: x, Y:y})
# Step 9: output the values of w and b
w_out, b_out = sess.run([w, b])
I've done the coursera machine learning course course so I (think) I understand the notion of Gradient Descent. But I'm quite lost as to what is happening in this specific case.
What I would expect to have to happen:
Calculate the gradient (either by calculus or numerical methods)
Calculate the parameter change (alpha multiplied by the predicted vs actual over the entire dataset)
Adjust the parameters
Repeat the above N times (in this case 100 times for 100 epochs)
I understand that in practice you'd apply things like batching and subsets but in this case I believe this is just looping over the entire dataset 100 times.
I can (and have) implemented this before. But I'm struggling to fathom how the code above could be achieving this. For one thing is the optimizer is called on each data point (i.e. it's in an inner loop of the 100 epochs and then each data point). I would have expected an optimization call which took in the entire dataset.
Question 1 - is the gradient adjustment operating over the entire data set 100 times, or over the entire data set 100 times in batches of 1 (so 100*n times, for n examples)?
Question 2 - how does the optimizer 'know' how to to adjust w and b? It's only provided the loss tensor - is it reading back through the graph and just going "well, w and b are the only variables, so I'll wiggle the hell out of those"
Question 2b - if so, what happens if you put in other variables? Or more complex functions? Does it just auto-magically calculate gradient adjustment for every variable in the predecessor graph**
Question 2c - pursuant to that I've tried adjusting to a quadratic expression as suggested in page 3 of the tutorial but end up getting a higher loss. Is this normal? The tutorial seems to suggest it should be better. At the least I would expect it not to be worse - is this subject to changing hyperparameters?
EDIT: Full code for my attempts to adjust to quadratic are here. Not that this is the same as the above with lines 28, 29, 30 and 34 modified to use a quadratic predictor. These edits are (what I interpret) to be what's suggested in the lecture 3 notes on page 4
""" Solution for simple linear regression example using placeholders
Created by Chip Huyen (chiphuyen#cs.stanford.edu)
CS20: "TensorFlow for Deep Learning Research"
cs20.stanford.edu
Lecture 03
"""
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
import time
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import utils
DATA_FILE = 'data/birth_life_2010.txt'
# Step 1: read in data from the .txt file
data, n_samples = utils.read_birth_life_data(DATA_FILE)
# Step 2: create placeholders for X (birth rate) and Y (life expectancy)
X = tf.placeholder(tf.float32, name='X')
Y = tf.placeholder(tf.float32, name='Y')
# Step 3: create weight and bias, initialized to 0
# w = tf.get_variable('weights', initializer=tf.constant(0.0)) old single weight
w = tf.get_variable('weights_1', initializer=tf.constant(0.0))
u = tf.get_variable('weights_2', initializer=tf.constant(0.0))
b = tf.get_variable('bias', initializer=tf.constant(0.0))
# Step 4: build model to predict Y
#Y_predicted = w * X + b #linear
Y_predicted = w * X * X + X * u + b #quadratic
#Y_predicted = w # test of nonsense
# Step 5: use the squared error as the loss function
# you can use either mean squared error or Huber loss
loss = tf.square(Y - Y_predicted, name='loss')
#loss = utils.huber_loss(Y, Y_predicted)
# Step 6: using gradient descent with learning rate of 0.001 to minimize loss
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001).minimize(loss)
start = time.time()
writer = tf.summary.FileWriter('./graphs/linear_reg', tf.get_default_graph())
with tf.Session() as sess:
# Step 7: initialize the necessary variables, in this case, w and b
sess.run(tf.global_variables_initializer())
# Step 8: train the model for 100 epochs
for i in range(100):
total_loss = 0
for x, y in data:
# Session execute optimizer and fetch values of loss
_, l = sess.run([optimizer, loss], feed_dict={X: x, Y:y})
total_loss += l
print('Epoch {0}: {1}'.format(i, total_loss/n_samples))
# close the writer when you're done using it
writer.close()
# Step 9: output the values of w and b
w_out, b_out = sess.run([w, b])
print('Took: %f seconds' %(time.time() - start))
print(f'w = {w_out}')
# plot the results
plt.plot(data[:,0], data[:,1], 'bo', label='Real data')
plt.plot(data[:,0], data[:,0] * w_out + b_out, 'r', label='Predicted data')
plt.legend()
plt.show()
For the linear predictor I get loss of (this aligns with lecture notes):
Epoch 99: 30.03552558278714
For my attempts at the quadratic I get loss of:
Epoch 99: 127.2992221294363
In the code you linked, it's 100 epochs in batches of 1 (assuming each element of data is a single input). I.e. compute the gradient of the loss with respect to a single example, update the parameters, go to the next example... until you went over the whole dataset. Do this 100 times.
A lot of things happen in that minimize call of the optimizer. Indeed, you only put in the cost: Under the hood, Tensorflow will then compute gradients for all requested variables (we'll get to that in a second) that are involved in the cost computation (it can infer this from the computational graph) and return an op that "applies" the gradients. This means an op that takes all the requested variables and assigns a new value to them, something like tf.assign(var, var - learning_rate*gradient). This is related to another question you asked: minimize returns just an op, this doesn't do anything! Running this op in a session will do a "gradient step" each time.
As to which variables are actually affected by this op: You can give this as an argument to the minimize call! See here -- the argument is var_list. If this is not given, Tensorflow will simply use all "trainable variables". By default, any variable you create with tf.Variable or tf.get_variable is trainable. However you can pass trainable=False to these functions to create variables that are not (by default) going to be affected by the op returned by minimize. Play around with this! See what happens if you set some variables not to be trainable, or if you pass a custom var_list to minimize.
In general, the "whole idea" of Tensorflow is that it can "magically" calculate gradients based on only a feedforward description of the model.
EDIT: This is possible because machine learning models (including deep learning) are composed of quite simple building blocks such as matrix multiplications and mostly pointwise nonlinearities. These simple blocks also have simple derivatives, which can be composed via the chain rule. You might want to read up on the backpropagation algorithm.
It will certainly take longer with very large models. But it is always possible as long as there is a clear "path" through the computation graph where all components have defined derivatives.
As to whether this can generate poor models: Yes, and this is a fundamental problem of deep learning. Very complex/deep models lead to highly non-convex cost functions which are difficult to optimize with methods like gradient descent.
With regards to the quadratic function: Looks like there are two problems here.
Not enough training epochs. More complex problems (in this case, we have more variables) might simply need longer to train. E.g. with your setup I can reach a cost of ~58 after about 330 epochs with the quadratic function.
The learning rate. The above is still suspicious since with more variables we should definitely be able to reach better results (as long as the inputs for those variables aren't superfluous), and since this is a simple linear regression problem gradient descent should be able to find them. In this case the learning rate is usually the problem. I changed it to 0.0001 (lowered by a factor of 10) and after about 3400 epochs reach costs below 30 (haven't tested how low it goes). Now obviously lower learning rates lead to slower training, but they are often necessary towards the end to avoid "jumping over" better solutions. This is why in practice, some kind of learning rate annealing is usually performed -- start with a large learning rate to make rapid progress in the beginning, then make it smaller and smaller as training progresses. In general, the learning rate (and its annealing schedule) is the hyperparameter that needs the most tuning in machine learning problems.
There are also methods such as Adam that use an "adaptive" learning rate. Generally an untuned adaptive method will outperform an untuned gradient descent, so they are good for quick experiments. However, well-tuned gradient descent will usually outperform them in turn.
I'm trying to create a simple neural net in tensorflow that learns some simple relationship between inputs and outputs (for example, y=-x) where the inputs and outputs are floating point values (meaning, no softmax used on the output).
I feel like this should be pretty easy to do, but I must be messing up somewhere. Wondering if there are any tutorials or examples out there that do something similar. I looked through the existing tensorflow tutorials and didn't see anything like this and looked through several other sources of tensorflow examples I found by googling, but didn't see what I was looking for.
Here's a trimmed down version of what I've been trying. In this particular version, I've noticed that my weights and biases always seem to be stuck at zero. Perhaps this is due to my single input and single output?
I've had good luck altering the mist example for various nefarious purposes, but everything I've gotten to work successfully used softmax on the output for categorization. If I can figure out how to generate a raw floating point output from my neural net, there are several fun projects I'd like to do with it.
Anyone see what I'm missing? Thanks in advance!
- J.
# Trying to define the simplest possible neural net where the output layer of the neural net is a single
# neuron with a "continuous" (a.k.a floating point) output. I want the neural net to output a continuous
# value based off one or more continuous inputs. My real problem is more complex, but this is the simplest
# representation of it for explaining my issue. Even though I've oversimplified this to look like a simple
# linear regression problem (y=m*x), I want to apply this to more complex neural nets. But if I can't get
# it working with this simple problem, then I won't get it working for anything more complex.
import tensorflow as tf
import random
import numpy as np
INPUT_DIMENSION = 1
OUTPUT_DIMENSION = 1
TRAINING_RUNS = 100
BATCH_SIZE = 10000
VERF_SIZE = 1
# Generate two arrays, the first array being the inputs that need trained on, and the second array containing outputs.
def generate_test_point():
x = random.uniform(-8, 8)
# To keep it simple, output is just -x.
out = -x
return ( np.array([ x ]), np.array([ out ]) )
# Generate a bunch of data points and then package them up in the array format needed by
# tensorflow
def generate_batch_data( num ):
xs = []
ys = []
for i in range(num):
x, y = generate_test_point()
xs.append( x )
ys.append( y )
return (np.array(xs), np.array(ys) )
# Define a single-layer neural net. Originally based off the tensorflow mnist for beginners tutorial
# Create a placeholder for our input variable
x = tf.placeholder(tf.float32, [None, INPUT_DIMENSION])
# Create variables for our neural net weights and bias
W = tf.Variable(tf.zeros([INPUT_DIMENSION, OUTPUT_DIMENSION]))
b = tf.Variable(tf.zeros([OUTPUT_DIMENSION]))
# Define the neural net. Note that since I'm not trying to classify digits as in the tensorflow mnist
# tutorial, I have removed the softmax op. My expectation is that 'net' will return a floating point
# value.
net = tf.matmul(x, W) + b
# Create a placeholder for the expected result during training
expected = tf.placeholder(tf.float32, [None, OUTPUT_DIMENSION])
# Same training as used in mnist example
cross_entropy = -tf.reduce_sum(expected*tf.log(tf.clip_by_value(net,1e-10,1.0)))
train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)
sess = tf.Session()
init = tf.initialize_all_variables()
sess.run(init)
# Perform our training runs
for i in range( TRAINING_RUNS ):
print "trainin run: ", i,
batch_inputs, batch_outputs = generate_batch_data( BATCH_SIZE )
# I've found that my weights and bias values are always zero after training, and I'm not sure why.
sess.run( train_step, feed_dict={x: batch_inputs, expected: batch_outputs})
# Test our accuracy as we train... I am defining my accuracy as the error between what I
# expected and the actual output of the neural net.
#accuracy = tf.reduce_mean(tf.sub( expected, net))
accuracy = tf.sub( expected, net) # using just subtract since I made my verification size 1 for debug
# Uncomment this to debug
#import pdb; pdb.set_trace()
batch_inputs, batch_outputs = generate_batch_data( VERF_SIZE )
result = sess.run(accuracy, feed_dict={x: batch_inputs, expected: batch_outputs})
print " progress: "
print " inputs: ", batch_inputs
print " outputs:", batch_outputs
print " actual: ", result
Your loss should be the squared difference of output and true value:
loss = tf.reduce_mean(tf.square(expected - net))
This way the network learns to optimize this loss and make the output closer to the real result. Cross entropy should only be used for output values between 0 and 1 i.e. for classification.
If anyone is interested, I got this example to work. Here's the code:
# Trying to define the simplest possible neural net where the output layer of the neural net is a single
# neuron with a "continuous" (a.k.a floating point) output. I want the neural net to output a continuous
# value based off one or more continuous inputs. My real problem is more complex, but this is the simplest
# representation of it for explaining my issue. Even though I've oversimplified this to look like a simple
# linear regression problem (y=m*x), I want to apply this to more complex neural nets. But if I can't get
# it working with this simple problem, then I won't get it working for anything more complex.
import tensorflow as tf
import random
import numpy as np
INPUT_DIMENSION = 1
OUTPUT_DIMENSION = 1
TRAINING_RUNS = 100
BATCH_SIZE = 10000
VERF_SIZE = 1
# Generate two arrays, the first array being the inputs that need trained on, and the second array containing outputs.
def generate_test_point():
x = random.uniform(-8, 8)
# To keep it simple, output is just -x.
out = -x
return (np.array([x]), np.array([out]))
# Generate a bunch of data points and then package them up in the array format needed by
# tensorflow
def generate_batch_data(num):
xs = []
ys = []
for i in range(num):
x, y = generate_test_point()
xs.append(x)
ys.append(y)
return (np.array(xs), np.array(ys))
# Define a single-layer neural net. Originally based off the tensorflow mnist for beginners tutorial
# Create a placeholder for our input variable
x = tf.placeholder(tf.float32, [None, INPUT_DIMENSION])
# Create variables for our neural net weights and bias
W = tf.Variable(tf.zeros([INPUT_DIMENSION, OUTPUT_DIMENSION]))
b = tf.Variable(tf.zeros([OUTPUT_DIMENSION]))
# Define the neural net. Note that since I'm not trying to classify digits as in the tensorflow mnist
# tutorial, I have removed the softmax op. My expectation is that 'net' will return a floating point
# value.
net = tf.matmul(x, W) + b
# Create a placeholder for the expected result during training
expected = tf.placeholder(tf.float32, [None, OUTPUT_DIMENSION])
# Same training as used in mnist example
loss = tf.reduce_mean(tf.square(expected - net))
# cross_entropy = -tf.reduce_sum(expected*tf.log(tf.clip_by_value(net,1e-10,1.0)))
train_step = tf.train.GradientDescentOptimizer(0.01).minimize(loss)
sess = tf.Session()
init = tf.initialize_all_variables()
sess.run(init)
# Perform our training runs
for i in range(TRAINING_RUNS):
print("trainin run: ", i, )
batch_inputs, batch_outputs = generate_batch_data(BATCH_SIZE)
# I've found that my weights and bias values are always zero after training, and I'm not sure why.
sess.run(train_step, feed_dict={x: batch_inputs, expected: batch_outputs})
# Test our accuracy as we train... I am defining my accuracy as the error between what I
# expected and the actual output of the neural net.
# accuracy = tf.reduce_mean(tf.sub( expected, net))
accuracy = tf.subtract(expected, net) # using just subtract since I made my verification size 1 for debug
# tf.subtract()
# Uncomment this to debug
# import pdb; pdb.set_trace()
print("W=%f, b=%f" % (sess.run(W), sess.run(b)))
batch_inputs, batch_outputs = generate_batch_data(VERF_SIZE)
result = sess.run(accuracy, feed_dict={x: batch_inputs, expected: batch_outputs})
print(" progress: ")
print(" inputs: ", batch_inputs)
print(" outputs:", batch_outputs)
print(" actual: ", result)
When using the built in, easy way of constructing the NN, I used
loss=tf.keras.losses.MeanSquaredError().
I went through the code and I'm afraid I don't grasp an important point.
I can't seem to find the weights matrix of the model for the encoder and decoder, neither where they are updated. I found the target_weights but it seems to be reinitialized at every get_batch() call so I don't really understand what they stand for either.
My actual goal is to concatenate two hidden states of two source encoders for one decoder by applying a linear transformation with a weight matrix that I'll have to train along with the model (I'm building a manytoone model), but I have no idea where to start because of my problem mentionned above.
This might help you start. There are a couple of models implemented in tensorflow.python.ops.seq2seq.py (with/without buckets, attention, etc.) but take a look at the definition for embedding_attention_seq2seq (which is the one called in their example model seq2seq_model.py that you seem to be referencing):
def embedding_attention_seq2seq(encoder_inputs, decoder_inputs, cell,
num_encoder_symbols, num_decoder_symbols,
num_heads=1, output_projection=None,
feed_previous=False, dtype=dtypes.float32,
scope=None, initial_state_attention=False):
with variable_scope.variable_scope(scope or "embedding_attention_seq2seq"):
# Encoder.
encoder_cell = rnn_cell.EmbeddingWrapper(cell, num_encoder_symbols)
encoder_outputs, encoder_state = rnn.rnn(
encoder_cell, encoder_inputs, dtype=dtype)
# First calculate a concatenation of encoder outputs to put attention on.
top_states = [array_ops.reshape(e, [-1, 1, cell.output_size])
for e in encoder_outputs]
attention_states = array_ops.concat(1, top_states)
....
You can see where it picks out the top layer of encoder outputs as top_states before handing them off to the decoder.
So you could implement a similar function with two encoders and concatenate those states before handing off to the decoder.
The value created in the get_batch function is only used for the first iteration. Even though the weights are passed every time into the function, their value gets updated as a global variable in the Seq2Seq model class in the init function.
with tf.name_scope('Optimizer'):
# Gradients and SGD update operation for training the model.
params = tf.trainable_variables()
if not forward_only:
self.gradient_norms = []
self.updates = []
opt = tf.train.GradientDescentOptimizer(self.learning_rate)
for b in range(len(buckets)):
gradients = tf.gradients(self.losses[b], params)
clipped_gradients, norm = tf.clip_by_global_norm(gradients,
max_gradient_norm)
self.gradient_norms.append(norm)
self.updates.append(opt.apply_gradients(
zip(clipped_gradients, params), global_step=self.global_step))
self.saver = tf.train.Saver(tf.global_variables())
The weights are fed seperately as a place-holder because they are normalized in the get_batch function to create zero weights for the PAD inputs.
# Batch decoder inputs are re-indexed decoder_inputs, we create weights.
for length_idx in range(decoder_size):
batch_decoder_inputs.append(
np.array([decoder_inputs[batch_idx][length_idx]
for batch_idx in range(self.batch_size)], dtype=np.int32))
# Create target_weights to be 0 for targets that are padding.
batch_weight = np.ones(self.batch_size, dtype=np.float32)
for batch_idx in range(self.batch_size):
# We set weight to 0 if the corresponding target is a PAD symbol.
# The corresponding target is decoder_input shifted by 1 forward.
if length_idx < decoder_size - 1:
target = decoder_inputs[batch_idx][length_idx + 1]
if length_idx == decoder_size - 1 or target == data_utils.PAD_ID:
batch_weight[batch_idx] = 0.0
batch_weights.append(batch_weight)