Applying gradients with feed_dict for gradients - tensorflow

I want to do some non-tensorflow processing on the computed gradients, before applying them on the variables.
My plan was to run the gradient ops that I get from the compute_gradients function , do my processing (in python without tensorflow), and then run the apply operation I get from the apply_gradients function and feed the processed gradients in the feed_dict. Unfortunately, this doesn't work in my scenario.
I managed to narrow it down to some issue with tf.nn.embedding_lookup (same happens with tf.gather), and the error can be reproduced as follows (using tf1.4):
import tensorflow as tf
x = tf.placeholder(dtype=tf.float32, shape=[])
z = tf.placeholder(dtype=tf.int32, shape=[])
emb_mat = tf.get_variable('w', [100, 5], initializer=tf.truncated_normal_initializer(stddev=0.1))
emb = tf.nn.embedding_lookup(emb_mat, z)
loss = x - tf.reduce_sum(emb) # Just some silly loss
opt = tf.train.GradientDescentOptimizer(0.1)
grads_and_vars = opt.compute_gradients(loss, tf.trainable_variables())
train_op = opt.apply_gradients(grads_and_vars)
grads = [g for g,v in grads_and_vars]
tsess = tf.Session()
tsess.run(tf.global_variables_initializer())
gradsres = tsess.run(grads, {x: 1.0, z: 1})
tsess.run(train_op, {g:r for g,r in zip(grads, gradsres)})
which results in the error
Traceback (most recent call last):
File "/home/cruvadom/.p2/pool/plugins/org.python.pydev_6.0.0.201709191431/pysrc/_pydevd_bundle/pydevd_exec.py", line 3, in Exec
exec exp in global_vars, local_vars
File "<console>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 889, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1098, in _run
raise ValueError('Tensor %s may not be fed.' % subfeed_t)
ValueError: Tensor Tensor("gradients/Gather_1_grad/ToInt32:0", shape=(2,), dtype=int32, device=/device:GPU:0) may not be fed.
It seems there is some additional tensor I need to feed to the graph for the computation. What is the right way to do why I want to do?
Thanks!

If you run the training operation, it will automatically calculate the gradients. You can retrieve the gradients from the session:
tsess = tf.Session()
tsess.run(tf.global_variables_initializer())
_, grads_and_vars, loss = tsess.run([train_op ,grads, loss], {x: 1.0, z: 1})
assert not np.isnan(loss), 'Something wrong! loss is nan...'
#Get the gradients
for g, v in grads_and_vars:
if g is not None:
grad_hist_summary = tf.summary.histogram("{}/grad_histogram".format(v.name), g)
sparsity_summary = tf.summary.scalar("{}/grad/sparsity".format(v.name), tf.nn.zero_fraction(g))

Related

How to backpropagate with complex valued weights

We are currently trying to replicate the results of the following paper: https://openreview.net/forum?id=H1S8UE-Rb
To do so, we need to run backpropagation on a neural network which contains complex valued weights.
When we try to do so (with code [0]), we get an error (at [1]). We cannot find the source code for any project that trains a neural network containing complex valued weights.
We were wondering if we would need to implement the paper's backpropagation adjustments ourselves or if this is already part of some neural network libraries. If it needs to be implemented in Tensorflow, what would be the proper steps to achieve that?
[0]:
def define_neuron(x):
"""
x is input tensor
"""
x = tf.cast(x, tf.complex64)
mnist_x = mnist_y = 28
n = mnist_x * mnist_y
c = 10
m = 10 # m needs to be calculated
with tf.name_scope("linear_combination"):
complex_weight = weight_complex_variable([n,m])
complex_bias = bias_complex_variable([m])
h_1 = x # complex_weight + complex_bias
return h_1
def main(_):
mnist = input_data.read_data_sets(
FLAGS.data_dir,
one_hot=True,
)
# `None` for the first dimension in this shape means that it is variable.
x_shape = [None, 784]
x = tf.placeholder(tf.float32, x_shape)
y_ = tf.placeholder(tf.float32, [None, 10])
yz = h_1 = define_neuron(x)
y = tf.nn.softmax(tf.abs(yz))
with tf.name_scope('loss'):
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(
labels=y_,
logits=y,
)
cross_entropy = tf.reduce_mean(cross_entropy)
with tf.name_scope('adam_optimizer'):
optimizer = tf.train.AdamOptimizer(1e-4)
optimizer = tf.train.GradientDescentOptimizer(1e-4)
train_step = optimizer.minimize(cross_entropy)
[1]:
Extracting /tmp/tensorflow/mnist/input_data/train-images-idx3-ubyte.gz
Extracting /tmp/tensorflow/mnist/input_data/train-labels-idx1-ubyte.gz
Extracting /tmp/tensorflow/mnist/input_data/t10k-images-idx3-ubyte.gz
Extracting /tmp/tensorflow/mnist/input_data/t10k-labels-idx1-ubyte.gz
Traceback (most recent call last):
File "complex.py", line 156, in <module>
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "/Users/kevin/wdev/learn_tensor/env/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "complex.py", line 58, in main
train_step = optimizer.minimize(cross_entropy)
File "/Users/kevin/wdev/learn_tensor/env/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 343, in minimize
grad_loss=grad_loss)
File "/Users/kevin/wdev/learn_tensor/env/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 419, in compute_gradients
[v for g, v in grads_and_vars
File "/Users/kevin/wdev/learn_tensor/env/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 547, in _assert_valid_dtypes
dtype, t.name, [v for v in valid_dtypes]))
ValueError: Invalid type tf.complex64 for linear_combination/Variable:0, expected: [tf.float32, tf.float64, tf.float16].
I have also tried to implement a similar network in tensorflow and saw that the optimizer cannot do backpropagation using complex valued tensors. The work around is to have separate real tensors for the real and imaginary parts. You will have to do write a function that will get the amplitude of the "complex" output of the network which is simply Re^2 - Im^2. This output value is what you will use to compute the loss.
Using the optimizer won't work it is a reported issue and I don't think tf 2 support it yet. You can however make it by hand, for example:
[...]
gradients = tf.gradients(mse, [weights])[0]
training_op = tf.assign(weights, weights - learning_rate * gradients)
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
sess.run(training_op)
Gradients here do as expected and compute the gradient as it should. Here is the discussion on what the gradient compute for complex variables.

What's the equivalent of this Keras code in TensorFlow?

The code is as below and runs perfectly:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Dropout
xData = np.array([[5, 3, 7], [1, 2, 6], [8, 7, 6]], dtype=np.float32)
yTrainData = np.array([[1], [0], [1]], dtype=np.float32)
model = Sequential()
model.add(Dense(64, input_dim=3, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
model.fit(xData, yTrainData, epochs=10, batch_size=128, verbose=2)
xTestData = np.array([[2, 8, 1], [3, 1, 9]], dtype=np.float32)
resultAry = model.predict(xTestData)
print("Cal result: %s" % resultAry)
I can't work out the code in TensowFlow, something I've written is like this:
import tensorflow as tf
import numpy as np
xData = np.array([[5, 3, 7], [1, 2, 6], [8, 7, 6]], dtype=np.float32)
yTrainData = np.array([[1], [0], [1]], dtype=np.float32)
x = tf.placeholder(tf.float32)
yTrain = tf.placeholder(tf.float32)
w = tf.Variable(tf.ones([64]), dtype=tf.float32)
b = tf.Variable(tf.zeros([1]), dtype=tf.float32)
y = tf.nn.relu(w * x + b)
w1 = tf.Variable(tf.ones([3]), dtype=tf.float32)
b1 = tf.Variable(0, dtype=tf.float32)
y1 = tf.reduce_mean(tf.nn.sigmoid(w1 * y + b1))
loss = tf.abs(y1 - tf.reduce_mean(yTrain))
optimizer = tf.train.AdadeltaOptimizer(0.1)
train = optimizer.minimize(loss)
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
for i in range(10):
for j in range(3):
result = sess.run([loss, y1, yTrain, x, w, b, train], feed_dict={x: xData[j], yTrain: yTrainData[j]})
if i % 10 == 0:
print("i: %d, j: %d, loss: %10.10f, y1: %f, yTrain: %s, x: %s" % (i, j, float(result[0]), float(result[1]), yTrainData[j], xData[j]))
result = sess.run([y1, loss], feed_dict={x: [1, 6, 0], yTrain: 0})
print(result)
But I will got the following error while running,
Traceback (most recent call last):
File "C:\Python36\lib\site-packages\tensorflow\python\client\session.py", line 1327, in _do_call
return fn(*args)
File "C:\Python36\lib\site-packages\tensorflow\python\client\session.py", line 1306, in _run_fn
status, run_metadata)
File "C:\Python36\lib\contextlib.py", line 88, in __exit__
next(self.gen)
File "C:\Python36\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [64] vs. [3]
[[Node: mul = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](Variable/read, _arg_Placeholder_0_0)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "testidc.py", line 36, in <module>
result = sess.run([loss, y1, yTrain, x, w, b, train], feed_dict={x: xData[j], yTrain: yTrainData[j]})
File "C:\Python36\lib\site-packages\tensorflow\python\client\session.py", line 895, in run
run_metadata_ptr)
File "C:\Python36\lib\site-packages\tensorflow\python\client\session.py", line 1124, in _run
feed_dict_tensor, options, run_metadata)
File "C:\Python36\lib\site-packages\tensorflow\python\client\session.py", line 1321, in _do_run
options, run_metadata)
File "C:\Python36\lib\site-packages\tensorflow\python\client\session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [64] vs. [3]
[[Node: mul = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](Variable/read, _arg_Placeholder_0_0)]]
Caused by op 'mul', defined at:
File "testidc.py", line 15, in <module>
y = tf.nn.relu(w * x + b)
File "C:\Python36\lib\site-packages\tensorflow\python\ops\variables.py", line 705, in _run_op
return getattr(ops.Tensor, operator)(a._AsTensor(), *args)
File "C:\Python36\lib\site-packages\tensorflow\python\ops\math_ops.py", line 865, in binary_op_wrapper
return func(x, y, name=name)
File "C:\Python36\lib\site-packages\tensorflow\python\ops\math_ops.py", line 1088, in _mul_dispatch
return gen_math_ops._mul(x, y, name=name)
File "C:\Python36\lib\site-packages\tensorflow\python\ops\gen_math_ops.py", line 1449, in _mul
result = _op_def_lib.apply_op("Mul", x=x, y=y, name=name)
File "C:\Python36\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 767, in apply_op
op_def=op_def)
File "C:\Python36\lib\site-packages\tensorflow\python\framework\ops.py", line 2630, in create_op
original_op=self._default_original_op, op_def=op_def)
File "C:\Python36\lib\site-packages\tensorflow\python\framework\ops.py", line 1204, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InvalidArgumentError (see above for traceback): Incompatible shapes: [64] vs. [3]
[[Node: mul = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](Variable/read, _arg_Placeholder_0_0)]]
The main reason is the shape of W, must be the same as x in TensowFlow, but in Keras, the hidden Dense layer could have more nodes than the input(such as 64 in the example).
I need help for the equivalent TensorFlow code instead of the Keras one. Thanks.
This is an example that uses the tf.estimator.Estimator framework:
import tensorflow as tf
import numpy as np
# The model
def model(features):
dense = tf.layers.dense(inputs=features['x'], units=64, activation=tf.nn.relu)
dropout = tf.layers.dropout(dense, 0.2)
logits = tf.layers.dense(inputs=dropout, units=1, activation=tf.nn.sigmoid)
return logits
# Stuff needed to use the tf.estimator.Estimator framework
def model_fn(features, labels, mode):
logits = model(features)
predictions = {
'classes': tf.argmax(input=logits, axis=1),
'probabilities': tf.nn.softmax(logits)
}
if mode == tf.estimator.ModeKeys.PREDICT:
return tf.estimator.EstimatorSpec(mode=mode, predictions=predictions)
loss = tf.losses.softmax_cross_entropy(onehot_labels=labels, logits=logits)
# Configure the training op
if mode == tf.estimator.ModeKeys.TRAIN:
optimizer = tf.train.RMSPropOptimizer(learning_rate=1e-4)
train_op = optimizer.minimize(loss, tf.train.get_or_create_global_step())
else:
train_op = None
accuracy = tf.metrics.accuracy(
tf.argmax(labels, axis=1), predictions['classes'])
metrics = {'accuracy': accuracy}
# Create a tensor named train_accuracy for logging purposes
tf.identity(accuracy[1], name='train_accuracy')
tf.summary.scalar('train_accuracy', accuracy[1])
return tf.estimator.EstimatorSpec(
mode=mode,
predictions=predictions,
loss=loss,
train_op=train_op,
eval_metric_ops=metrics)
# Setting up input for the model
def input_fn(mode, batch_size):
# function that processes your input and returns two tensors "samples" and "labels"
# that the estimator will use to fetch input batches.
# See https://www.tensorflow.org/get_started/input_fn for how to write this function.
return samples, labels
# Using the model
def main():
# Create the Estimator
classifier = tf.estimator.Estimator(
model_fn=model_fn, model_dir='some_dir')
# Train the model
# NOTE: I use this to make it compatible with your example, but you should
# defnitely set up your own input_fn above
train_input_fn = tf.estimator.inputs.numpy_input_fn(
x={"x": np.array([[5, 3, 7], [1, 2, 6], [8, 7, 6]], dtype=np.float32)},
y=np.array([[1], [0], [1]]),
num_epochs=10,
batch_size=128,
shuffle=False)
classifier.train(
input_fn=train_input_fn,
steps=20000, # change as needed
)
# Predict on new data
predict_input_fn = tf.estimator.inputs.numpy_input_fn(
x={"x": np.array([[5, 3, 7], [1, 2, 6], [8, 7, 6]], dtype=np.float32)},
num_epochs=1,
batch_size=1,
shuffle=False)
predictions_iterator = classifier.predict(
input_fn=predict_input_fn)
print('Predictions results:')
for pred in predictions_iterator:
print(pred)
There is quite bit going on here, so I'll try to explain the blocks one by one.
The model
The model is defined as a composition of tf.layers in a separate model function. This is done to keep the actual model_fn (which is required by the Estimator framework) independent of the model architecture.
The function takes a features parameter, which is the output of a call to input_fn (see below). In this example, since we're using tf.estimator.inputs.numpy_input_fn, features is a dictionary with item x:input_tensor. We use the input tensor as input for our model graph.
model_fn
This function is required by the framework and is used to generate a specification for your Estimator that is dependent on the mode the estimato is being used for. Typically, an estimator used for prediction will have less operations than when it's used for training (you don't have the loss, optimizer, etc). This function takes care of adding all that is necessary to your model graph for the three possible modes of operation (prediction, evaluation, training).
Breaking it down to logical pieces, we have:
Prediction: we only need the model graph, the predictions and the corresponding predicted labels (we could skip the labels, but having it here is handy).
Evaluation: we need everything for prediction plus: a loss function, some metric to evaluate on and optionally some summaries to visualize the metrics in Tensorboard.
Training: we need everything for evaluation plus: a training operation from an optimizer (in your sample, RMSProp)
input_fn
This is where we provide the input to our estimator.
Have a look at Building Input Functions with tf.estimator for a guide on how your custom input_fn should look like. For the example, we'll use the numpy_input_fn function from the framework.
Note that usually one input_fn handles all operation modes according to a mode parameter. Since we're using numpy_input_fn, we need two different instances of it for training and prediction to provide the data as needed.
main
Here we actually train and use the estimator.
Firstly, we get an Estimator instance with the model_fn we specified, then we call train() and wait for the training to be over.
Once that is done, calling predict() returns an iterable that you can use to get the prediction results for all the samples in the dataset you're predicting.
This is a couple of months old but it's worth noting that there is absolutely no reason to not use keras with tensorflow. It's even part of the tensorflow library now!
So if you want full control of your tensors but still want to use keras' layers, you can easily achieve that by using keras as-is:
x = tf.placeholder(tf.float32, [None, 1024])
y = keras.layers.Dense(512, activation='relu')(x)
For more on that, keras' creator made a pretty cool post about it.

Tensorflow does NOT utilize the memory from two GPUs in Windows 10

Tensorflow Version 1.3.0
OS: Windows 10
GPUs: Nvidia Quadro M4000 * 2 with 8G GPU memory for each
GPU modes: one for WDDM, one for TCC
I tested the official codes at https://github.com/tensorflow/models/blob/master/official/resnet/imagenet_main.py
I just add the GPU constraints in the main function as:
def main(unused_argv):
os.environ['TF_ENABLE_WINOGRAD_NONFUSED'] = '1'
# For this line, visible_divice_list set to only "0" and "0, 1" can only support the same batch_size
config = tf.ConfigProto(gpu_options=tf.GPUOptions(visible_device_list='0, 1'))
resnet_classifier = tf.estimator.Estimator(
model_fn=imagenet_model_fn, model_dir=FLAGS.model_dir,
config=tf.contrib.learn.RunConfig(session_config=config))
for cycle in range(FLAGS.train_steps // FLAGS.steps_per_eval):
tensors_to_log = {
'learning_rate': 'learning_rate',
'cross_entropy': 'cross_entropy',
'train_accuracy': 'train_accuracy'
}
logging_hook = tf.train.LoggingTensorHook(
tensors=tensors_to_log, every_n_iter=100)
print('Starting a training cycle.')
resnet_classifier.train(
input_fn=lambda: input_fn(tf.estimator.ModeKeys.TRAIN),
steps=FLAGS.first_cycle_steps or FLAGS.steps_per_eval,
hooks=[logging_hook])
FLAGS.first_cycle_steps = None
print('Starting to evaluate.')
eval_results = resnet_classifier.evaluate(
input_fn=lambda: input_fn(tf.estimator.ModeKeys.EVAL))
print(eval_results)
In the training process, if I set the visible device list to "0, 1" or "0" only, both can run successfully with batch_size=48, but BOTH failed with batch_size=49! This indicates that the second GPU's memory is not utilized, as batch size could not be bigger when using two GPUs. I have use Nvidia-smi to confirm that only one or two GPUs are used in the above experiments.
My questions are:
Is there any way that I can use bigger batch_size when using two GPUs?
If the answer for Q1 is No in Windows, is there any way to do it in Linux? I am not familiar with Linux. In Linux, can I set all GPUs to TCC mode? Will the batch size be bigger when two GPUs are both in TCC mode?
Thank you.
-------------Update------------
I have tried to distributed the data batch on two GPUs and there is NaN loss error now. Would there be any possible cause for this? It runs well before (using one GPU only). But now even I set the _DEVICE_LIST to one GPU only, it still produce the NaN loss error.
My modified codes are:
def imagenet_model_fn(features, labels, mode):
tf.summary.image('images', features, max_outputs=6)
with tf.device('/cpu:0'):
split_batch = tf.split(features, len(_DEVICE_LIST))
split_labels = tf.split(labels, len(_DEVICE_LIST))
all_predictions = {
'classes': [],
'probabilities': []
}
all_cross_entropy = []
all_reg_loss = []
with tf.variable_scope(tf.get_variable_scope()):
for dev_idx, (device, device_features, device_labels) in enumerate(zip(
_DEVICE_LIST, split_batch, split_labels)):
with tf.device(device):
with tf.name_scope('device_%d' % dev_idx):
logits = network(inputs=device_features,
is_training=(mode == tf.estimator.ModeKeys.TRAIN))
tf.get_variable_scope().reuse_variables()
all_predictions['classes'].append(tf.argmax(logits, axis=1))
all_predictions['probabilities'].append(tf.nn.softmax(logits))
if mode == tf.estimator.ModeKeys.TRAIN:
# Calculate loss, which includes softmax cross entropy and L2 regularization.
cross_entropy = tf.losses.softmax_cross_entropy(
logits=logits, onehot_labels=device_labels)
reg_loss = FLAGS.weight_decay * tf.add_n(
[tf.nn.l2_loss(v) for v in tf.trainable_variables()])
all_cross_entropy.append(cross_entropy)
all_reg_loss.append(reg_loss)
all_predictions['classes'] = tf.reshape(all_predictions['classes'], [-1])
all_predictions['probabilities'] = tf.reshape(
all_predictions['probabilities'], [-1])
total_cross_entropy = tf.add_n(all_cross_entropy)
total_reg_loss = tf.add_n(all_reg_loss)
total_loss = total_cross_entropy + total_reg_loss
tf.identity(total_cross_entropy, name='cross_entropy')
tf.summary.scalar('cross_entropy', total_cross_entropy)
tf.summary.scalar('reg_loss', total_reg_loss)
tf.summary.scalar('total_loss', total_loss)
if mode == tf.estimator.ModeKeys.TRAIN:
global_step = tf.train.get_or_create_global_step()
boundaries = [
int(batches_per_epoch * epoch) for epoch in [30, 60, 120, 150]]
values = [
_INITIAL_LEARNING_RATE * decay for decay in [1, 0.1, 0.01, 1e-3, 1e-4]]
learning_rate = tf.train.piecewise_constant(
tf.cast(global_step, tf.int32), boundaries, values)
tf.identity(learning_rate, name='learning_rate')
tf.summary.scalar('learning_rate', learning_rate)
optimizer = tf.train.MomentumOptimizer(
learning_rate=learning_rate,
momentum=_MOMENTUM)
# Batch norm requires update_ops to be added as a train_op dependency.
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
train_op = optimizer.minimize(total_loss, global_step)
else:
train_op = None
return tf.estimator.EstimatorSpec(
mode=mode,
predictions=all_predictions,
loss=total_loss,
train_op=train_op)
The error message is:
INFO:tensorflow:Saving checkpoints for 1 into F:\projects\DeepLearning\TensorFlow\Models\ImageNet\resnet_101_imagenet_augmented\temp\model.ckpt.
INFO:tensorflow:learning_rate = 0.003125, cross_entropy = 14.394
INFO:tensorflow:loss = 30.0782, step = 1
ERROR:tensorflow:Model diverged with loss = NaN.
Traceback (most recent call last):
File "imagenet_main.py", line 321, in <module>
tf.app.run()
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "imagenet_main.py", line 310, in main
hooks=[logging_hook])
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\estimator\estimator.py", line 241, in train
loss = self._train_model(input_fn=input_fn, hooks=hooks)
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\estimator\estimator.py", line 686, in _train_model
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\training\monitored_session.py", line 518, in run
run_metadata=run_metadata)
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\training\monitored_session.py", line 862, in run
run_metadata=run_metadata)
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\training\monitored_session.py", line 818, in run
return self._sess.run(*args, **kwargs)
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\training\monitored_session.py", line 980, in run
run_metadata=run_metadata))
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\training\basic_session_run_hooks.py", line 551, in after_run
raise NanLossDuringTrainingError
tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training.

Running distributed Tensorflow with InvalidArgumentError: You must feed a value for placeholder tensor 'Placeholder' with dtype float

I have implemented a variational autoencoder with tensorflow on a single machine. Now I am trying to run it on my cluster with the distributed mechanism provided tensorflow. But the following problem had stuck me for several days.
Traceback (most recent call last):
File "/home/yama/mfs/ZhuSuan/examples/vae.py", line 265, in <module>
print('>> Test log likelihood = {}'.format(np.mean(test_lls)))
File "/usr/lib/python2.7/contextlib.py", line 35, in __exit__
self.gen.throw(type, value, traceback)
File "/mfs/yama/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 942, in managed_session
self.stop(close_summary_writer=close_summary_writer)
File "/mfs/yama/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 768, in stop
stop_grace_period_secs=self._stop_grace_secs)
File "/mfs/yama/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 322, in join
six.reraise(*self._exc_info_to_raise)
File "/mfs/yama/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 267, in stop_on_exception
yield
File "/mfs/yama/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 411, in run
self.run_loop()
File "/mfs/yama/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 972, in run_loop
self._sv.global_step])
File "/mfs/yama/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 372, in run
run_metadata_ptr)
File "/mfs/yama/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 636, in _run
feed_dict_string, options, run_metadata)
File "/mfs/yama/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 708, in _do_run
target_list, options, run_metadata)
File "/mfs/yama/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 728, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.InvalidArgumentError: You must feed a value for placeholder tensor 'Placeholder' with dtype float
[[Node: Placeholder = Placeholder[dtype=DT_FLOAT, shape=[], _device="/job:worker/replica:0/task:0/gpu:0"]()]]
[[Node: model_1/fully_connected_10/Relu_G88 = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:0/cpu:0", send_device="/job:worker/replica:0/task:0/gpu:0", send_device_incarnation=3964479821165574552, tensor_name="edge_694_model_1/fully_connected_10/Relu", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:0/cpu:0"]()]]
Caused by op u'Placeholder', defined at:
File "/home/yama/mfs/ZhuSuan/examples/vae.py", line 201, in <module>
x = tf.placeholder(tf.float32, shape=(None, x_train.shape[1]))
File "/mfs/yama/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/array_ops.py", line 895, in placeholder
name=name)
File "/mfs/yama/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 1238, in _placeholder
name=name)
File "/mfs/yama/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/op_def_library.py", line 704, in apply_op
op_def=op_def)
File "/mfs/yama/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2260, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/mfs/yama/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1230, in __init__
self._traceback = _extract_stack()
Here is my code, I just paste the main function for simplicity:
if __name__ == "__main__":
tf.set_random_seed(1234)
# Load MNIST
data_path = os.path.join(os.path.dirname(os.path.abspath(__file__)),
'data', 'mnist.pkl.gz')
x_train, t_train, x_valid, t_valid, x_test, t_test = \
dataset.load_mnist_realval(data_path)
x_train = np.vstack([x_train, x_valid])
np.random.seed(1234)
x_test = np.random.binomial(1, x_test, size=x_test.shape).astype('float32')
# Define hyper-parametere
n_z = 40
# Define training/evaluation parameters
lb_samples = 1
ll_samples = 5000
epoches = 10
batch_size = 100
test_batch_size = 100
iters = x_train.shape[0] // batch_size
test_iters = x_test.shape[0] // test_batch_size
test_freq = 10
ps_hosts = FLAGS.ps_hosts.split(",")
worker_hosts = FLAGS.worker_hosts.split(",")
# Create a cluster from the parameter server and worker hosts.
clusterSpec = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})
print("Create and start a server for the local task.")
# Create and start a server for the local task.
server = tf.train.Server(clusterSpec,
job_name=FLAGS.job_name,
task_index=FLAGS.task_index)
print("Start ps and worker server")
if FLAGS.job_name == "ps":
server.join()
elif FLAGS.job_name == "worker":
#set distributed device
with tf.device(tf.train.replica_device_setter(
worker_device="/job:worker/task:%d" % FLAGS.task_index,
cluster=clusterSpec)):
print("Build the training computation graph")
# Build the training computation graph
x = tf.placeholder(tf.float32, shape=(None, x_train.shape[1]))
optimizer = tf.train.AdamOptimizer(learning_rate=0.001, epsilon=1e-4)
with tf.variable_scope("model") as scope:
with pt.defaults_scope(phase=pt.Phase.train):
train_model = M1(n_z, x_train.shape[1])
train_vz_mean, train_vz_logstd = q_net(x, n_z)
train_variational = ReparameterizedNormal(
train_vz_mean, train_vz_logstd)
grads, lower_bound = advi(
train_model, x, train_variational, lb_samples, optimizer)
infer = optimizer.apply_gradients(grads)
print("Build the evaluation computation graph")
# Build the evaluation computation graph
with tf.variable_scope("model", reuse=True) as scope:
with pt.defaults_scope(phase=pt.Phase.test):
eval_model = M1(n_z, x_train.shape[1])
eval_vz_mean, eval_vz_logstd = q_net(x, n_z)
eval_variational = ReparameterizedNormal(
eval_vz_mean, eval_vz_logstd)
eval_lower_bound = is_loglikelihood(
eval_model, x, eval_variational, lb_samples)
eval_log_likelihood = is_loglikelihood(
eval_model, x, eval_variational, ll_samples)
global_step = tf.Variable(0)
saver = tf.train.Saver()
summary_op = tf.merge_all_summaries()
init_op = tf.initialize_all_variables()
# Create a "supervisor", which oversees the training process.
sv = tf.train.Supervisor(is_chief=(FLAGS.task_index == 0),
logdir=LogDir,
init_op=init_op,
summary_op=summary_op,
saver=saver,
global_step=global_step,
save_model_secs=600)
# Run the inference
with sv.managed_session(server.target) as sess:
epoch = 0
while not sv.should_stop() and epoch < epoches:
#for epoch in range(1, epoches + 1):
np.random.shuffle(x_train)
lbs = []
for t in range(iters):
x_batch = x_train[t * batch_size:(t + 1) * batch_size]
x_batch = np.random.binomial( n=1, p=x_batch, size=x_batch.shape).astype('float32')
_, lb = sess.run([infer, lower_bound], feed_dict={x: x_batch})
lbs.append(lb)
if epoch % test_freq == 0:
test_lbs = []
test_lls = []
for t in range(test_iters):
test_x_batch = x_test[
t * test_batch_size: (t + 1) * test_batch_size]
test_lb, test_ll = sess.run(
[eval_lower_bound, eval_log_likelihood],
feed_dict={x: test_x_batch}
)
test_lbs.append(test_lb)
test_lls.append(test_ll)
print('>> Test lower bound = {}'.format(np.mean(test_lbs)))
print('>> Test log likelihood = {}'.format(np.mean(test_lls)))
sv.stop()
I have try to correct my code for several days, but all my efforts have failed. Looking for your help!
The most likely cause of this exception is that one of the operations that the tf.train.Supervisor runs in the background depends on the tf.placeholder() tensor x, but doesn't have enough information to feed a value for it.
The most likely culprit is summary_op = tf.merge_all_summaries(), because library code often summarizes values that depend on the training data. To prevent the supervisor from collecting summaries in the background, pass summary_op=None to the tf.train.Supervisor constructor:
# Create a "supervisor", which oversees the training process.
sv = tf.train.Supervisor(is_chief=(FLAGS.task_index == 0),
logdir=LogDir,
init_op=init_op,
summary_op=None,
saver=saver,
global_step=global_step,
save_model_secs=600)
After doing this, you will need to make alternative arrangements to collect summaries. The easiest way to do this is to pass summary_op to sess.run() periodically, then pass the result to sv.summary_computed().
Came across a similar thing. The chief was going down with the aforementioned error message. However, since I was using the MonitoredTrainingSession rather than a self-made Supervisor, I was able to solve the problem by disabling the default summary. To disable, you have to provide
save_summaries_secs=None,
save_summaries_steps=None,
to the constructor of the MonitoredTrainingSession. Afterwards, everything went just smooth!
Code on Github
I had the same exact problem. Following mrry's suggestion I was able to work this out by:
Disabling summary logging in the supervisor by setting summary_op=None (as mrry suggested)
Creating my own summary_op and pass it to sess.run() along with the rest of the ops to be evaluated. Hold on the resulting summary, let's say it's called 'my_summary'.
Creating my own summary writer. Call it with 'my_summary', e.g.: summary_writer.add_summary(summary, epoch_count)
To clarify, I did not use mrry's suggestion to do
sess.run(summary_op) and sv.summary_computed(), but instead ran the summary_op along with the other operations, and then wrote out the summary myself. You might also want to condition the summary writing on being a chief.
So basically, you need to bypass the Supervisor's summary writing services completely. Seems like surprising limitation/bug of Supervisor since it isn't exactly uncommon to want to log things that depend on the input (which lives in a placeholder). For example in my network (an autoencoder) the cost depends on the input.

TensorFlow shape error in feed_dict

I'm trying to adapt existing logreg example to my data and are getting the following error:
Epoch: 0001 cost=
Traceback (most recent call last):
File "tflin.py", line 64, in <module>
print "Epoch:", '%04d' % (epoch+1), "cost=", "{:.9f}".format(sess.run(cost, feed_dict={X: train_X, Y:train_Y})), \
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 315, in run
return self._run(None, fetches, feed_dict)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 506, in _run
% (np_val.shape, subfeed_t.name, str(subfeed_t.get_shape())))
ValueError: Cannot feed value of shape (60000, 6) for Tensor u'Placeholder:0', which has shape '(6,)'
Source code can be found here: https://github.com/ilautar/tensorflow-test/blob/master/tflin.py
I'm sure it is obvious, any pointers?
Thank you,
Igor
The error occurs because you are trying to feed a 60000 x 6 matrix into a tf.placeholder() that is defined to be a vector of length 6. This happens when you try to feed the whole train_X matrix (as opposed to feeding a single row, which succeeds).
The best way to make this work is to do the following:
Define your placeholders (and model) in terms of batch inputs, which can have varying shape:
# tf Graph Input
X = tf.placeholder(tf.float32, [None, n_input])
Y = tf.placeholder(tf.float32, [None])
When feeding in a single example, extend it to be a 1 x 6 matrix using numpy.newaxis:
# Fit all training data
for epoch in range(training_epochs):
for (x, y) in zip(train_X, train_Y):
sess.run(optimizer, feed_dict={X: x[numpy.newaxis, ...],
Y: y[numpy.newaxis, ...]})