Confusion Matrix with Tensorflow - tensorflow

I am using finetune AlexNet architecture written by #kratzert on my own dataset which, works properly (I got the code from here: https://github.com/kratzert/finetune_alexnet_with_tensorflow) and I want to figure out how to build confusion matrix from his code. I have tried to use tf.confusion_matrix(labels, predictions, num_classes) to build confusion matrix but I can't. I am confused what should be the values for labels and predictions, I mean, I know what should be but each time I feed these value got an error. Can anyone help me on this or have a look at the code (above link) and guide me?
I added these two lines in finetune.py exactly after calculating accuracy to make the labels and the predictions as the number of the class.
with tf.name_scope("accuracy"):
correct_pred = tf.equal(tf.argmax(score, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
**true_class = tf.argmax(y, 1)
predicted_class = tf.argmax(score, 1)**
and I have added tf.confusion_matrix() inside my session at the very bottom before saving checkpoint of the model
for _ in range(val_batches_per_epoch):
img_batch, label_batch = sess.run(next_batch)
acc, cost = sess.run([accuracy, loss], feed_dict={x: img_batch,
y: label_batch,
keep_prob: 1.})
test_acc += acc
test_count += 1
test_acc /= test_count
print("{} Validation Accuracy = {:.4f} -- Validation Loss = {:.4f}".format(datetime.now(),test_acc, cost))
print("{} Saving checkpoint of model...".format(datetime.now()))
**print(sess.run(tf.confusion_matrix(true_class, predicted_class, num_classes)))**
# save checkpoint of the model
checkpoint_name = os.path.join(checkpoint_path,
'model_epoch'+str(epoch+1)+'.ckpt')
save_path = saver.save(sess, checkpoint_name)
print("{} Model checkpoint saved at {}".format(datetime.now(),
checkpoint_name))
I have tried other places as well but each time I will get an error:
Caused by op 'Placeholder_1', defined at:
File "/home/armin/Desktop/Alexnet_DataPipeline/finetune.py", line 85, in <module>
y = tf.placeholder(tf.float32, [batch_size, num_classes])
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/array_ops.py", line 1777, in placeholder
return gen_array_ops.placeholder(dtype=dtype, shape=shape, name=name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 4521, in placeholder
"Placeholder", dtype=dtype, shape=shape, name=name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3290, in create_op
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1654, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InvalidArgumentError (see above for traceback): You must feed a value for placeholder tensor 'Placeholder_1' with dtype float and shape [128,3]
any help will be appreciated, Thanks.

It's a fairly long piece of code you're referring to, and you did not specify where you put your confusion matrix line.
Just by experience, the most frequent problem with confusion matrices is that tf.confusion_matrix() requires both the labels and the predictions as the number of the class, not as one-hot vectors. In other words, the label and the prediction should be in the form of the number 5 instead of [ 0, 0, 0, 0, 0, 1, 0, 0, 0, 0 ].
In the code you refer to, y is in the one-hot format. The output of the network, score is a vector, giving the probability of each class. That is also not the required format. You need to do something like
true_class = tf.argmax( y, 1 )
predicted_class = tf.argmax( score, 1 )
and use those with the confusion matrix like
tf.confusion_matrix( true_class, predicted_class, num_classes )
(Basically, if you take a look at line 123 of finetune.py, that has both of those elements for determining accuracy, but they are not saved in separate tensors.)
If you want to keep a running total of confusion matrices of all batches, you just have to add them up - since each cell of the matrix counts the number of examples falling into that category, an element-wise addition creates the confusion matrix for the whole set:
cm_running_total = None
cm_nupmy_array = sess.run(tf.confusion_matrix(true_class, predicted_class, num_classes), feed_dict={x: img_batch, y: label_batch, keep_prob: 1.} )
if cm_running_total is None:
cm_running_total = cm_numpy_array
else:
cm_running_total += cm_numpy_array

Related

Trouble with TensorFlow and MNIST recognition

Beforehand, I thank you for analyzing my post and helping out. I've recently gotten interested in ML with Tensorflow,
but I've encountered a problem with my code. I'm reading a book called Learning TensorFlow, and I've written out the whole thing
from the first example. They are analyzing MNIST images, and I've also added my own comments with my perspective on how things work
in the code. When I run the code, however, I get an error. Here's my code, and the error.
#Import tensorflow under the name of ts
import tensorflow as tf
#Import MNIST tutorial data from tensorflow
from tensorflow.examples.tutorials.mnist import input_data
#Declare constants
#Data path
DATA_DIR = 'C:/tmp/data'
#Number of steps
NUM_STEPS = 1000
#Number of examples per step
MINIBATCH_SIZE = 100
#When we read the data-set it saves it locally under our data path, or under c:/tmp/data
data = input_data.read_data_sets(DATA_DIR, one_hot = True)
#Our placeholder X is the image. Placeholders are supplied when running the computation graph
x = tf.placeholder(tf.float32, [None, 784])
#Create a variable representing the weights. Variables are manipulated by the computation graph
W = tf.Variable(tf.zeros([784, 10]))
y_true = tf.placeholder(tf.float32, [None, 784])
y_pred = tf.matmul(x, W)
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(
logits=y_pred, labels=y_true))
gd_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)
correct_mask = tf.equal(tf.argmax(y_pred, 1), tf.argmax(y_true, 1))
accuracy = tf.reduce_mean(tf.cast(correct_mask, tf.float32))
with tf.Session() as sess:
#Initialize global variables
sess.run(tf.global_variables_initializer())
for _ in range(NUM_STEPS):
batch_xs, batch_ys = data.train.next_batch(MINIBATCH_SIZE)
sess.run(gd_step, feed_dict={x: batch_xs, y_true: batch_ys})
ans = sess.run(accuracy, feed_dict={x: data.test.images,
y_true: data.test.labels})
print("Accuracy: {:.4}%".format(ans*100))
Now here's the error.
runfile('C:/Users/user/.spyder-py3/temp.py', wdir='C:/Users/user/.spyder-py3')
Extracting C:/tmp/data\train-images-idx3-ubyte.gz
Extracting C:/tmp/data\train-labels-idx1-ubyte.gz
Extracting C:/tmp/data\t10k-images-idx3-ubyte.gz
Extracting C:/tmp/data\t10k-labels-idx1-ubyte.gz
Traceback (most recent call last):
File "<ipython-input-11-bf503334b166>", line 1, in <module>
runfile('C:/Users/user/.spyder-py3/temp.py', wdir='C:/Users/CwWJc/.spyder-py3')
File "C:\Users\user\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 827, in runfile
execfile(filename, namespace)
File "C:\Users\user\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/user/.spyder-py3/temp.py", line 38, in <module>
sess.run(gd_step, feed_dict={x: batch_xs, y_true: batch_ys})
File "C:\Users\user\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 950, in run
run_metadata_ptr)
File "C:\Users\user\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 1149, in _run
str(subfeed_t.get_shape())))
ValueError: Cannot feed value of shape (100, 10) for Tensor
'Placeholder_15:0', which has shape '(?, 784)'
Any help is greatly appreciated. Sorry if I'm making a stupid mistake. I find that I often do, though. Thanks in advance! Also, sorry for garbage formatting. :)
Hahaha! I got y_true mixed up. Sorry for the hassle everyone.

Tensorflow won't matmul inputs and weights. "Dimensions must be equal"

I've been working on a simple tensor flow neural network. My input placeholder is
x = tf.placeholder(tf.float32, shape=[None, 52000, 3]).
My weight matrix is initialized to all zeros as
W = tf.Variable(tf.zeros([52000, 10])).
I tried different combinations with and without the 3 for color channels, but I guess I'm just not understanding the dimensionality because I got the error:
Traceback (most recent call last): File
"C:\Users\Everybody\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\framework\common_shapes.py",
line 686, in _call_cpp_shape_fn_impl
input_tensors_as_shapes, status) File "C:\Users\Everybody\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\framework\errors_impl.py",
line 473, in exit
c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.InvalidArgumentError: Shape
must be rank 2 but is rank 3 for 'MatMul' (op: 'MatMul') with input
shapes: [?,52000,3], [52000,10].
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "rating.py", line 65, in
y = tf.matmul(x, W) + b # "fake" outputs to train/test File "C:\Users\Everybody\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\ops\math_ops.py",
line 1891, in matmul
a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name) File
"C:\Users\Everybody\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\ops\gen_math_ops.py",
line 2436, in _mat_mul
name=name) File "C:\Users\Everybody\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\framework\op_def_library.py",
line 787, in _apply_op_helper
op_def=op_def) File "C:\Users\Everybody\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\framework\ops.py",
line 2958, in create_op
set_shapes_for_outputs(ret) File "C:\Users\Everybody\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\framework\ops.py",
line 2209, in set_shapes_for_outputs
shapes = shape_func(op) File "C:\Users\Everybody\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\framework\ops.py",
line 2159, in call_with_requiring
return call_cpp_shape_fn(op, require_shape_fn=True) File "C:\Users\Everybody\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\framework\common_shapes.py",
line 627, in call_cpp_shape_fn
require_shape_fn) File "C:\Users\Everybody\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\framework\common_shapes.py",
line 691, in _call_cpp_shape_fn_impl
raise ValueError(err.message) ValueError: Shape must be rank 2 but is rank 3 for 'MatMul' (op: 'MatMul') with input shapes: [?,52000,3],
[52000,10].
At first, I thought my next_batch() function was the culprit because I had to make my own due to the fact that I uploaded my images "manually" using scipy.misc.imread(), whose definition reads:
q = 0
def next_batch(batch_size):
x = images[q:q + batch_size]
y = one_hots[q:q + batch_size]
q = (q + batch_size) % len(images)
return x, y
However, after looking through, I don't see what's wrong with this, so I imagine that I'm just confused about dimensionality. It is supposed to be a "flattened" 200x260 color image. It just occurred to me now that maybe I have to flatten the color channels as well? I will place my full code below if curious. I'm a bit new to Tensorflow, so thanks, all. (Yes, it is not a CNN yet, I decided to start simple just to make sure I'm importing my dataset right. And, I know it is tiny, I'm starting my dataset small too.)
############# IMPORT DEPENDENCIES ####################################
import tensorflow as tf
sess = tf.InteractiveSession() #start session
import scipy.misc
import numpy as np
######################################################################
#SET UP DATA #########################################################
images = []
one_hots = []
########### IMAGES ##################################################
#put all the images in a list
for i in range(60):
images.append(scipy.misc.imread('./shoes/%s.jpg' % str(i+1)))
print("One image appended...\n")
#normalize them, "divide" by 255
for image in images:
print("One image normalized...\n")
for i in range(260):
for j in range(200):
for c in range(3):
image[i][j][c]/=255
for image in images:
tf.reshape(image, [52000, 3])
########################################################################
################# ONE-HOT VECTORS ######################################
f = open('rateVectors.txt')
lines = f.readlines()
for i in range(0, 600, 10):
fillerlist = []
for j in range(10):
fillerlist.append(float(lines[i+j][:-1]))
one_hots.append(fillerlist)
print("One one-hot vector added...\n")
########################################################################3
#set placeholders and such for input, output, weights, biases
x = tf.placeholder(tf.float32, shape=[None, 52000, 3])
y_ = tf.placeholder(tf.float32, shape=[None, 10])
W = tf.Variable(tf.zeros([52000, 10])) # These are our weights and biases
b = tf.Variable(tf.zeros([10])) # initialized as zeroes.
#########################################################################
sess.run(tf.global_variables_initializer()) #initialize variables in the session
y = tf.matmul(x, W) + b # "fake" outputs to train/test
##################### DEFINING OUR MODEL ####################################
#our loss function
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y, y_))
#defining our training as gradient descent
train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)
###################### TRAINING #############################################
#################### OUR CUSTOM BATCH FUNCTION ##############################
q = 0
def next_batch(batch_size):
x = images[q:q + batch_size]
y = one_hots[q:q + batch_size]
q = (q + batch_size) % len(images)
return x, y
#train
for i in range(6):
batch = next_batch(10)
train_step.run(feed_dict={x: batch[0], y_: batch[1]})
print("Batch Number: " + i + "\n")
print("Done training...\n")
################ RESULTS #################################################
#calculating accuracy
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
#print accuracy
print(accuracy.eval(feed_dict={x: images, y_: one_hots}))
Your placeholder should have the dimension [None, 200, 260, 3] where None is the batch size, 200, 260 is the image size, and 3 is the channels.
Your weights should be [filter_height, filter_width, num_channels, num_filters]
Your bias should be [num_filters]
And the dimensions for the labels should be [None, num_classes] where None is the batch size, and num_classes is the number of classes that your images have.
These are just to make sure that math works.
I took these codes from here

Custom loss function: perform a model.predict on the data in y_pred

I am training a network to denoise images, for this I am using the CIFAR10 dataset. I am trying to generate a custom loss function so that the loss is mse / classification_accuracy.
Given that my network receives as input 32x32 (noisy) images and predicts 32x32 (denoised) images, I am assuming that y_pred and Y_true would be arrays of 32x32 images. Thus my custom loss functions looks like this:
def custom_loss():
def joint_optimized_loss(y_true, y_pred):
mse = K.mean(K.square(y_pred - y_true), axis=-1)
preds = classif_model.predict(y_pred)
correctPreds = 0
totPreds = 0
for pred in preds:
predictedClass = pred.index(max(pred))
totPreds += 1
if predictedClass == currentClass:
correctPreds += 1
classifAccuracy = correctPreds / totPreds
loss = mse / classifAccuracy
return loss
return joint_optimized_loss
myModel.compile(optimizer='adadelta', loss=custom_loss())
classif_model is a pre-trained model that classifies CIFAR10 images into one of the 10 classes. It receives an array of 32x32 images.
However when I run my code I get the following error:
Traceback (most recent call last):
File "myCode.py", line 94, in
myModel.compile(optimizer='adadelta', loss=custom_loss())
File "/home/rvidalma/anaconda2/envs/tensorUpdated/lib/python2.7/site-packages/keras/engine/training.py",
line 850, in compile
sample_weight, mask)
File "/home/rvidalma/anaconda2/envs/tensorUpdated/lib/python2.7/site-packages/keras/engine/training.py",
line 450, in weighted
score_array = fn(y_true, y_pred)
File "myCode.py", line 57, in joint_optimized_loss
preds = classif_model.predict(y_pred)
File "/home/rvidalma/anaconda2/envs/tensorUpdated/lib/python2.7/site-packages/keras/models.py",
line 913, in predict
return self.model.predict(x, batch_size=batch_size, verbose=verbose)
File "/home/rvidalma/anaconda2/envs/tensorUpdated/lib/python2.7/site-packages/keras/engine/training.py",
line 1713, in predict
verbose=verbose, steps=steps)
File "/home/rvidalma/anaconda2/envs/tensorUpdated/lib/python2.7/site-packages/keras/engine/training.py",
line 1260, in _predict_loop
batches = _make_batches(num_samples, batch_size)
File "/home/rvidalma/anaconda2/envs/tensorUpdated/lib/python2.7/site-packages/keras/engine/training.py",
line 374, in _make_batches
num_batches = int(np.ceil(size / float(batch_size)))
AttributeError: 'Dimension' object has no attribute 'ceil'
I think this has something to do with the fact that y_true and y_pred are both tensors that, before training, are empty thus classif_model.predict fails as it is expecting an array. However I am not sure on how to fix this...
I tried getting instead the value of y_pred using K.get_value(y_pred), but that gives me the following error:
tensorflow.python.framework.errors_impl.InvalidArgumentError: Shape
[-1,32,32,3] has negative dimensions [[Node: input_1 =
Placeholderdtype=DT_FLOAT, shape=[?,32,32,3],
_device="/job:localhost/replica:0/task:0/cpu:0"]]
You cannot use accuracy as a loss function, as it is not differentiable. This is why upper bounds on accuracy like the cross-entropy are used instead.
Additionally, the way you implemented accuracy is also non-symbolic, you should have used only functions in keras.backend to implement a loss for it to work properly.
I had almost same problem, and I tried this and it worked for me.
Instead of:
preds = classif_model.predict(y_pred)
try:
preds = classif_model(y_pred)
I am not sure about the reason but it is because when we use model.predict(y) it need batch_size and while compiling we don't have any, so we can not use model.predict(y).
Please correct me if this is wrong.

How to backpropagate with complex valued weights

We are currently trying to replicate the results of the following paper: https://openreview.net/forum?id=H1S8UE-Rb
To do so, we need to run backpropagation on a neural network which contains complex valued weights.
When we try to do so (with code [0]), we get an error (at [1]). We cannot find the source code for any project that trains a neural network containing complex valued weights.
We were wondering if we would need to implement the paper's backpropagation adjustments ourselves or if this is already part of some neural network libraries. If it needs to be implemented in Tensorflow, what would be the proper steps to achieve that?
[0]:
def define_neuron(x):
"""
x is input tensor
"""
x = tf.cast(x, tf.complex64)
mnist_x = mnist_y = 28
n = mnist_x * mnist_y
c = 10
m = 10 # m needs to be calculated
with tf.name_scope("linear_combination"):
complex_weight = weight_complex_variable([n,m])
complex_bias = bias_complex_variable([m])
h_1 = x # complex_weight + complex_bias
return h_1
def main(_):
mnist = input_data.read_data_sets(
FLAGS.data_dir,
one_hot=True,
)
# `None` for the first dimension in this shape means that it is variable.
x_shape = [None, 784]
x = tf.placeholder(tf.float32, x_shape)
y_ = tf.placeholder(tf.float32, [None, 10])
yz = h_1 = define_neuron(x)
y = tf.nn.softmax(tf.abs(yz))
with tf.name_scope('loss'):
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(
labels=y_,
logits=y,
)
cross_entropy = tf.reduce_mean(cross_entropy)
with tf.name_scope('adam_optimizer'):
optimizer = tf.train.AdamOptimizer(1e-4)
optimizer = tf.train.GradientDescentOptimizer(1e-4)
train_step = optimizer.minimize(cross_entropy)
[1]:
Extracting /tmp/tensorflow/mnist/input_data/train-images-idx3-ubyte.gz
Extracting /tmp/tensorflow/mnist/input_data/train-labels-idx1-ubyte.gz
Extracting /tmp/tensorflow/mnist/input_data/t10k-images-idx3-ubyte.gz
Extracting /tmp/tensorflow/mnist/input_data/t10k-labels-idx1-ubyte.gz
Traceback (most recent call last):
File "complex.py", line 156, in <module>
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "/Users/kevin/wdev/learn_tensor/env/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "complex.py", line 58, in main
train_step = optimizer.minimize(cross_entropy)
File "/Users/kevin/wdev/learn_tensor/env/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 343, in minimize
grad_loss=grad_loss)
File "/Users/kevin/wdev/learn_tensor/env/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 419, in compute_gradients
[v for g, v in grads_and_vars
File "/Users/kevin/wdev/learn_tensor/env/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 547, in _assert_valid_dtypes
dtype, t.name, [v for v in valid_dtypes]))
ValueError: Invalid type tf.complex64 for linear_combination/Variable:0, expected: [tf.float32, tf.float64, tf.float16].
I have also tried to implement a similar network in tensorflow and saw that the optimizer cannot do backpropagation using complex valued tensors. The work around is to have separate real tensors for the real and imaginary parts. You will have to do write a function that will get the amplitude of the "complex" output of the network which is simply Re^2 - Im^2. This output value is what you will use to compute the loss.
Using the optimizer won't work it is a reported issue and I don't think tf 2 support it yet. You can however make it by hand, for example:
[...]
gradients = tf.gradients(mse, [weights])[0]
training_op = tf.assign(weights, weights - learning_rate * gradients)
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
sess.run(training_op)
Gradients here do as expected and compute the gradient as it should. Here is the discussion on what the gradient compute for complex variables.

Running distributed Tensorflow with InvalidArgumentError: You must feed a value for placeholder tensor 'Placeholder' with dtype float

I have implemented a variational autoencoder with tensorflow on a single machine. Now I am trying to run it on my cluster with the distributed mechanism provided tensorflow. But the following problem had stuck me for several days.
Traceback (most recent call last):
File "/home/yama/mfs/ZhuSuan/examples/vae.py", line 265, in <module>
print('>> Test log likelihood = {}'.format(np.mean(test_lls)))
File "/usr/lib/python2.7/contextlib.py", line 35, in __exit__
self.gen.throw(type, value, traceback)
File "/mfs/yama/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 942, in managed_session
self.stop(close_summary_writer=close_summary_writer)
File "/mfs/yama/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 768, in stop
stop_grace_period_secs=self._stop_grace_secs)
File "/mfs/yama/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 322, in join
six.reraise(*self._exc_info_to_raise)
File "/mfs/yama/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 267, in stop_on_exception
yield
File "/mfs/yama/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 411, in run
self.run_loop()
File "/mfs/yama/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 972, in run_loop
self._sv.global_step])
File "/mfs/yama/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 372, in run
run_metadata_ptr)
File "/mfs/yama/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 636, in _run
feed_dict_string, options, run_metadata)
File "/mfs/yama/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 708, in _do_run
target_list, options, run_metadata)
File "/mfs/yama/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 728, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.InvalidArgumentError: You must feed a value for placeholder tensor 'Placeholder' with dtype float
[[Node: Placeholder = Placeholder[dtype=DT_FLOAT, shape=[], _device="/job:worker/replica:0/task:0/gpu:0"]()]]
[[Node: model_1/fully_connected_10/Relu_G88 = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:0/cpu:0", send_device="/job:worker/replica:0/task:0/gpu:0", send_device_incarnation=3964479821165574552, tensor_name="edge_694_model_1/fully_connected_10/Relu", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:0/cpu:0"]()]]
Caused by op u'Placeholder', defined at:
File "/home/yama/mfs/ZhuSuan/examples/vae.py", line 201, in <module>
x = tf.placeholder(tf.float32, shape=(None, x_train.shape[1]))
File "/mfs/yama/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/array_ops.py", line 895, in placeholder
name=name)
File "/mfs/yama/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 1238, in _placeholder
name=name)
File "/mfs/yama/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/op_def_library.py", line 704, in apply_op
op_def=op_def)
File "/mfs/yama/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2260, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/mfs/yama/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1230, in __init__
self._traceback = _extract_stack()
Here is my code, I just paste the main function for simplicity:
if __name__ == "__main__":
tf.set_random_seed(1234)
# Load MNIST
data_path = os.path.join(os.path.dirname(os.path.abspath(__file__)),
'data', 'mnist.pkl.gz')
x_train, t_train, x_valid, t_valid, x_test, t_test = \
dataset.load_mnist_realval(data_path)
x_train = np.vstack([x_train, x_valid])
np.random.seed(1234)
x_test = np.random.binomial(1, x_test, size=x_test.shape).astype('float32')
# Define hyper-parametere
n_z = 40
# Define training/evaluation parameters
lb_samples = 1
ll_samples = 5000
epoches = 10
batch_size = 100
test_batch_size = 100
iters = x_train.shape[0] // batch_size
test_iters = x_test.shape[0] // test_batch_size
test_freq = 10
ps_hosts = FLAGS.ps_hosts.split(",")
worker_hosts = FLAGS.worker_hosts.split(",")
# Create a cluster from the parameter server and worker hosts.
clusterSpec = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})
print("Create and start a server for the local task.")
# Create and start a server for the local task.
server = tf.train.Server(clusterSpec,
job_name=FLAGS.job_name,
task_index=FLAGS.task_index)
print("Start ps and worker server")
if FLAGS.job_name == "ps":
server.join()
elif FLAGS.job_name == "worker":
#set distributed device
with tf.device(tf.train.replica_device_setter(
worker_device="/job:worker/task:%d" % FLAGS.task_index,
cluster=clusterSpec)):
print("Build the training computation graph")
# Build the training computation graph
x = tf.placeholder(tf.float32, shape=(None, x_train.shape[1]))
optimizer = tf.train.AdamOptimizer(learning_rate=0.001, epsilon=1e-4)
with tf.variable_scope("model") as scope:
with pt.defaults_scope(phase=pt.Phase.train):
train_model = M1(n_z, x_train.shape[1])
train_vz_mean, train_vz_logstd = q_net(x, n_z)
train_variational = ReparameterizedNormal(
train_vz_mean, train_vz_logstd)
grads, lower_bound = advi(
train_model, x, train_variational, lb_samples, optimizer)
infer = optimizer.apply_gradients(grads)
print("Build the evaluation computation graph")
# Build the evaluation computation graph
with tf.variable_scope("model", reuse=True) as scope:
with pt.defaults_scope(phase=pt.Phase.test):
eval_model = M1(n_z, x_train.shape[1])
eval_vz_mean, eval_vz_logstd = q_net(x, n_z)
eval_variational = ReparameterizedNormal(
eval_vz_mean, eval_vz_logstd)
eval_lower_bound = is_loglikelihood(
eval_model, x, eval_variational, lb_samples)
eval_log_likelihood = is_loglikelihood(
eval_model, x, eval_variational, ll_samples)
global_step = tf.Variable(0)
saver = tf.train.Saver()
summary_op = tf.merge_all_summaries()
init_op = tf.initialize_all_variables()
# Create a "supervisor", which oversees the training process.
sv = tf.train.Supervisor(is_chief=(FLAGS.task_index == 0),
logdir=LogDir,
init_op=init_op,
summary_op=summary_op,
saver=saver,
global_step=global_step,
save_model_secs=600)
# Run the inference
with sv.managed_session(server.target) as sess:
epoch = 0
while not sv.should_stop() and epoch < epoches:
#for epoch in range(1, epoches + 1):
np.random.shuffle(x_train)
lbs = []
for t in range(iters):
x_batch = x_train[t * batch_size:(t + 1) * batch_size]
x_batch = np.random.binomial( n=1, p=x_batch, size=x_batch.shape).astype('float32')
_, lb = sess.run([infer, lower_bound], feed_dict={x: x_batch})
lbs.append(lb)
if epoch % test_freq == 0:
test_lbs = []
test_lls = []
for t in range(test_iters):
test_x_batch = x_test[
t * test_batch_size: (t + 1) * test_batch_size]
test_lb, test_ll = sess.run(
[eval_lower_bound, eval_log_likelihood],
feed_dict={x: test_x_batch}
)
test_lbs.append(test_lb)
test_lls.append(test_ll)
print('>> Test lower bound = {}'.format(np.mean(test_lbs)))
print('>> Test log likelihood = {}'.format(np.mean(test_lls)))
sv.stop()
I have try to correct my code for several days, but all my efforts have failed. Looking for your help!
The most likely cause of this exception is that one of the operations that the tf.train.Supervisor runs in the background depends on the tf.placeholder() tensor x, but doesn't have enough information to feed a value for it.
The most likely culprit is summary_op = tf.merge_all_summaries(), because library code often summarizes values that depend on the training data. To prevent the supervisor from collecting summaries in the background, pass summary_op=None to the tf.train.Supervisor constructor:
# Create a "supervisor", which oversees the training process.
sv = tf.train.Supervisor(is_chief=(FLAGS.task_index == 0),
logdir=LogDir,
init_op=init_op,
summary_op=None,
saver=saver,
global_step=global_step,
save_model_secs=600)
After doing this, you will need to make alternative arrangements to collect summaries. The easiest way to do this is to pass summary_op to sess.run() periodically, then pass the result to sv.summary_computed().
Came across a similar thing. The chief was going down with the aforementioned error message. However, since I was using the MonitoredTrainingSession rather than a self-made Supervisor, I was able to solve the problem by disabling the default summary. To disable, you have to provide
save_summaries_secs=None,
save_summaries_steps=None,
to the constructor of the MonitoredTrainingSession. Afterwards, everything went just smooth!
Code on Github
I had the same exact problem. Following mrry's suggestion I was able to work this out by:
Disabling summary logging in the supervisor by setting summary_op=None (as mrry suggested)
Creating my own summary_op and pass it to sess.run() along with the rest of the ops to be evaluated. Hold on the resulting summary, let's say it's called 'my_summary'.
Creating my own summary writer. Call it with 'my_summary', e.g.: summary_writer.add_summary(summary, epoch_count)
To clarify, I did not use mrry's suggestion to do
sess.run(summary_op) and sv.summary_computed(), but instead ran the summary_op along with the other operations, and then wrote out the summary myself. You might also want to condition the summary writing on being a chief.
So basically, you need to bypass the Supervisor's summary writing services completely. Seems like surprising limitation/bug of Supervisor since it isn't exactly uncommon to want to log things that depend on the input (which lives in a placeholder). For example in my network (an autoencoder) the cost depends on the input.