Keras Multi GPU example gives ResourceExhaustedError - tensorflow

So I try to use multiple GPUs with Keras. When I run training_utils.py with the example program (given as comments inside the training_utils.py code), I end up with ResourceExhaustedError. nvidia-smi tells me that barely one of the four GPUs are working. Using one GPU works fine for other programs.
TensorFlow 1.3.0
Keras 2.0.8
Ubuntu 16.04
CUDA/cuDNN 8.0/6.0
Question: Anyone have any idea whats going on here?
Console output:
(...)
2017-10-26 14:39:02.086838: W tensorflow/core/common_runtime/bfc_allocator.cc:277] ***************************************************************************************************x
2017-10-26 14:39:02.086857: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[128,55,55,256]
Traceback (most recent call last):
File "test.py", line 27, in
parallel_model.fit(x, y, epochs=20, batch_size=256)
File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/training.py", line 1631, in fit
validation_steps=validation_steps)
File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/training.py", line 1213, in _fit_loop
outs = f(ins_batch)
File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 2331, in call
**self.session_kwargs)
File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1124, in _run
feed_dict_tensor, options, run_metadata)
File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run
options, run_metadata)
File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[128,55,55,256]
[[Node: replica_1/xception/block3_sepconv2/separable_conv2d = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="VALID", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:1"](replica_1/xception/block3_sepconv2/separable_conv2d/depthwise, block3_sepconv2/pointwise_kernel/read/_2103)]]
[[Node: training/RMSprop/gradients/replica_0/xception/block10_sepconv2/separable_conv2d_grad/Conv2DBackpropFilter/_4511 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_25380_training/RMSprop/gradients/replica_0/xception/block10_sepconv2/separable_conv2d_grad/Conv2DBackpropFilter", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]]
Caused by op u'replica_1/xception/block3_sepconv2/separable_conv2d',
defined at: File "test.py", line 19, in
parallel_model = multi_gpu_model(model, gpus=2) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/utils/training_utils.py",
line 143, in multi_gpu_model
outputs = model(inputs) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/topology.py",
line 603, in call
output = self.call(inputs, **kwargs) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/topology.py",
line 2061, in call
output_tensors, _, _ = self.run_internal_graph(inputs, masks) File
"/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/topology.py",
line 2212, in run_internal_graph
output_tensors = _to_list(layer.call(computed_tensor, **kwargs)) File
"/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/layers/convolutional.py",
line 1221, in call
dilation_rate=self.dilation_rate) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py",
line 3279, in separable_conv2d
data_format=tf_data_format) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/nn_impl.py",
line 497, in separable_conv2d
name=name) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_nn_ops.py",
line 397, in conv2d
data_format=data_format, name=name) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py",
line 767, in apply_op
op_def=op_def) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py",
line 2630, in create_op
original_op=self._default_original_op, op_def=op_def) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py",
line 1204, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
ResourceExhaustedError (see above for traceback): OOM when allocating
tensor with shape[128,55,55,256] [[Node:
replica_1/xception/block3_sepconv2/separable_conv2d =
Conv2D[T=DT_FLOAT, data_format="NHWC", padding="VALID", strides=[1, 1,
1, 1], use_cudnn_on_gpu=true,
_device="/job:localhost/replica:0/task:0/gpu:1"](replica_1/xception/block3_sepconv2/separable_conv2d/depthwise,
block3_sepconv2/pointwise_kernel/read/_2103)]] [[Node:
training/RMSprop/gradients/replica_0/xception/block10_sepconv2/separable_conv2d_grad/Conv2DBackpropFilter/_4511
= _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0",
send_device="/job:localhost/replica:0/task:0/gpu:0",
send_device_incarnation=1,
tensor_name="edge_25380_training/RMSprop/gradients/replica_0/xception/block10_sepconv2/separable_conv2d_grad/Conv2DBackpropFilter",
tensor_type=DT_FLOAT,
_device="/job:localhost/replica:0/task:0/cpu:0"]]
EDIT (Added example code):
import tensorflow as tf
from keras.applications import Xception
from keras.utils import multi_gpu_model
import numpy as np
num_samples = 1000
height = 224
width = 224
num_classes = 100
with tf.device('/cpu:0'):
model = Xception(weights=None,
input_shape=(height, width, 3),
classes=num_classes)
parallel_model = multi_gpu_model(model, gpus=4)
parallel_model.compile(loss='categorical_crossentropy',
optimizer='rmsprop')
x = np.random.random((num_samples, height, width, 3))
y = np.random.random((num_samples, num_classes))
parallel_model.fit(x, y, epochs=20, batch_size=128)

When encountering OOM/ResourceExhaustedError on GPU I believe changing (Reducing) batch size is the right option to try at first.
For different GPU you may need different batch size based on the GPU
memory you have.
Recently I faced the similar type of problem, tweaked a lot to do the different type of experiment.
Here is the link to the question (also some tricks are included).
However, while reducing the size of the batch you may find that your training gets slower.

Related

Custom metric: Using scikit learn's AucRoc Calculator with tf.keras

I'm training a multilabel classifier using tf.keras and horovod that has 14 classes. AucRoc is used as the metric to evaluate the performance of the classifier. I want to be able to use scikit learn's AucRoc calculator as mentioned here: How to compute Receiving Operating Characteristic (ROC) and AUC in keras?. If I feed the tensors as is for the following function:
def sci_auc_roc(y_true, y_pred):
return tf.py_func(roc_auc_score(y_true, y_pred), tf.double)
I get an error that looks like this:
/mnt/lustrefs/rakvee/miniconda3/envs/docker_pip2/lib/python3.6/site-packages/keras_applications/resnet50.py:265: UserWarning: The output shape of `ResNet50(include_top=False)` has been changed since Keras 2.2.0.
warnings.warn('The output shape of `ResNet50(include_top=False)` '
Traceback (most recent call last):
File "official_resnet_tf_1.12.0_auc.py", line 531, in <module>
main()
File "official_resnet_tf_1.12.0_auc.py", line 420, in main
model = chexnet_model(FLAGS)
File "official_resnet_tf_1.12.0_auc.py", line 375, in chexnet_model
metrics=[tf_auc_roc,sci_auc_roc])
File "/mnt/lustrefs/rakvee/miniconda3/envs/docker_pip2/lib/python3.6/site-packages/tensorflow/python/training/checkpointable/base.py", line 474, in _method_wrapper
method(self, *args, **kwargs)
File "/mnt/lustrefs/rakvee/miniconda3/envs/docker_pip2/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 648, in compile
sample_weights=self.sample_weights)
File "/mnt/lustrefs/rakvee/miniconda3/envs/docker_pip2/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 313, in _handle_metrics
output, output_mask))
File "/mnt/lustrefs/rakvee/miniconda3/envs/docker_pip2/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 270, in _handle_per_output_metrics
y_true, y_pred, weights=weights, mask=mask)
File "/mnt/lustrefs/rakvee/miniconda3/envs/docker_pip2/lib/python3.6/site-packages/tensorflow/python/keras/engine/training_utils.py", line 598, in weighted
score_array = fn(y_true, y_pred)
File "official_resnet_tf_1.12.0_auc.py", line 327, in sci_auc_roc
return tf.py_func(roc_auc_score(y_true, y_pred), tf.double)
File "/mnt/lustrefs/rakvee/miniconda3/envs/docker_pip2/lib/python3.6/site-packages/sklearn/metrics/ranking.py", line 349, in roc_auc_score
y_type = type_of_target(y_true)
File "/mnt/lustrefs/rakvee/miniconda3/envs/docker_pip2/lib/python3.6/site-packages/sklearn/utils/multiclass.py", line 243, in type_of_target
'got %r' % y)
ValueError: Expected array-like (array or non-string sequence), got <tf.Tensor 'dense_target:0' shape=(?, ?) dtype=float32>
I'm trying to convert tf tensors into a numpy array and then feed them to the roc_auc_score method like so:
def sci_auc_roc(y_true, y_pred):
with tf.Session() as sess:
y_true, y_pred = sess.run([y_true, y_pred])
return tf.py_func(roc_auc_score(y_true, y_pred), tf.double)
I get the following error:
warnings.warn('The output shape of `ResNet50(include_top=False)` '
Traceback (most recent call last):
File "/mnt/lustrefs/rakvee/miniconda3/envs/docker_pip2/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/mnt/lustrefs/rakvee/miniconda3/envs/docker_pip2/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/mnt/lustrefs/rakvee/miniconda3/envs/docker_pip2/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: You must feed a value for placeholder tensor 'input_1' with dtype float and shape [?,256,256,3]
[[{{node input_1}} = Placeholder[dtype=DT_FLOAT, shape=[?,256,256,3], _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
[[{{node dense_target/_5}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2237_dense_target", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "official_resnet_tf_1.12.0_auc.py", line 531, in <module>
main()
File "official_resnet_tf_1.12.0_auc.py", line 420, in main
model = chexnet_model(FLAGS)
File "official_resnet_tf_1.12.0_auc.py", line 375, in chexnet_model
metrics=[tf_auc_roc,sci_auc_roc])
File "/mnt/lustrefs/rakvee/miniconda3/envs/docker_pip2/lib/python3.6/site-packages/tensorflow/python/training/checkpointable/base.py", line 474, in _method_wrapper
method(self, *args, **kwargs)
File "/mnt/lustrefs/rakvee/miniconda3/envs/docker_pip2/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 648, in compile
sample_weights=self.sample_weights)
File "/mnt/lustrefs/rakvee/miniconda3/envs/docker_pip2/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 313, in _handle_metrics
output, output_mask))
File "/mnt/lustrefs/rakvee/miniconda3/envs/docker_pip2/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 270, in _handle_per_output_metrics
y_true, y_pred, weights=weights, mask=mask)
File "/mnt/lustrefs/rakvee/miniconda3/envs/docker_pip2/lib/python3.6/site-packages/tensorflow/python/keras/engine/training_utils.py", line 598, in weighted
score_array = fn(y_true, y_pred)
File "official_resnet_tf_1.12.0_auc.py", line 324, in sci_auc_roc
y_true, y_pred = sess.run([y_true, y_pred])
File "/mnt/lustrefs/rakvee/miniconda3/envs/docker_pip2/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/mnt/lustrefs/rakvee/miniconda3/envs/docker_pip2/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/mnt/lustrefs/rakvee/miniconda3/envs/docker_pip2/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/mnt/lustrefs/rakvee/miniconda3/envs/docker_pip2/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: You must feed a value for placeholder tensor 'input_1' with dtype float and shape [?,256,256,3]
[[node input_1 (defined at /mnt/lustrefs/rakvee/miniconda3/envs/docker_pip2/lib/python3.6/site-packages/keras_applications/resnet50.py:214) = Placeholder[dtype=DT_FLOAT, shape=[?,256,256,3], _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
[[{{node dense_target/_5}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2237_dense_target", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Caused by op 'input_1', defined at:
File "official_resnet_tf_1.12.0_auc.py", line 531, in <module>
main()
File "official_resnet_tf_1.12.0_auc.py", line 420, in main
model = chexnet_model(FLAGS)
File "official_resnet_tf_1.12.0_auc.py", line 339, in chexnet_model
input_shape=(FLAGS.image_size, FLAGS.image_size, 3))
File "/mnt/lustrefs/rakvee/miniconda3/envs/docker_pip2/lib/python3.6/site-packages/tensorflow/python/keras/applications/__init__.py", line 70, in wrapper
return base_fun(*args, **kwargs)
File "/mnt/lustrefs/rakvee/miniconda3/envs/docker_pip2/lib/python3.6/site-packages/tensorflow/python/keras/applications/resnet50.py", line 32, in ResNet50
return resnet50.ResNet50(*args, **kwargs)
File "/mnt/lustrefs/rakvee/miniconda3/envs/docker_pip2/lib/python3.6/site-packages/keras_applications/resnet50.py", line 214, in ResNet50
img_input = layers.Input(shape=input_shape)
File "/mnt/lustrefs/rakvee/miniconda3/envs/docker_pip2/lib/python3.6/site-packages/tensorflow/python/keras/engine/input_layer.py", line 229, in Input
input_tensor=tensor)
File "/mnt/lustrefs/rakvee/miniconda3/envs/docker_pip2/lib/python3.6/site-packages/tensorflow/python/keras/engine/input_layer.py", line 112, in __init__
name=self.name)
File "/mnt/lustrefs/rakvee/miniconda3/envs/docker_pip2/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 1747, in placeholder
return gen_array_ops.placeholder(dtype=dtype, shape=shape, name=name)
File "/mnt/lustrefs/rakvee/miniconda3/envs/docker_pip2/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 5206, in placeholder
"Placeholder", dtype=dtype, shape=shape, name=name)
File "/mnt/lustrefs/rakvee/miniconda3/envs/docker_pip2/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/mnt/lustrefs/rakvee/miniconda3/envs/docker_pip2/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/mnt/lustrefs/rakvee/miniconda3/envs/docker_pip2/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/mnt/lustrefs/rakvee/miniconda3/envs/docker_pip2/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
self._traceback = tf_stack.extract_stack()
InvalidArgumentError (see above for traceback): You must feed a value for placeholder tensor 'input_1' with dtype float and shape [?,256,256,3]
[[node input_1 (defined at /mnt/lustrefs/rakvee/miniconda3/envs/docker_pip2/lib/python3.6/site-packages/keras_applications/resnet50.py:214) = Placeholder[dtype=DT_FLOAT, shape=[?,256,256,3], _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
[[{{node dense_target/_5}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2237_dense_target", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[52342,1],0]
Exit code: 1
--------------------------------------------------------------------------
I've also tried tensorflow's https://www.tensorflow.org/api_docs/python/tf/metrics/auc like so:
def tf_auc_roc(y_true, y_pred):
auc = tf.metrics.auc(y_true, y_pred)[1]
K.get_session().run(tf.local_variables_initializer())
return auc
It works just fine. However, it gives me a single number for aucroc. I wonder what that number represents, is it an average aucroc value for all the 14 classes? or max aucscores of all the classes? or how does it get to a single number?
1216/1216 [==============================] - 413s 340ms/step - loss: 0.1513 - tf_auc_roc: 0.7944 - val_loss: 0.2212 - val_tf_auc_roc: 0.8074
Epoch 2/15
582/1216 [=============>................] - ETA: 3:16 - loss: 0.1459 - tf_auc_roc: 0.8053
1) How do I fix the error with roc_auc_score?
2) What does that single number represent?
I think that the result of a metric should be a single tensor value that represents the average of the results as described here in the Keras documentation (which I find is the better documentation than that from TensorFlow).
You could instead use a custom callback to achieve your desired result, most probably you would want to write to disc the result on_epoch_end

how to see tensor value of a layer output in keras

I have a Seq2Seq model. I am interested to print out the matrix value of the output of the encoder per iteration.
So for example as the dimension of the matrix in the encoder is (?,20) and the epoch =5 and in each epoch, there are 10 iteration,
I would like to see 10 matrix of the dimension (?,20) per epoch.
I have gone to several links as here but it still does not print out the value matrix.
With this code as mentioned in the aboved link:
import keras.backend as K
k_value = K.print_tensor(encoded)
print(k_value)
I got:
Tensor("Print:0", shape=(?, 20), dtype=float32)
Is there any straightforward way of showing the tensor value of each layer in Keras?
Update 1
by trying this code: K_value = K.eval(encoded) it raises this error:
Traceback (most recent call last):
File "/home/sgnbx/anaconda3/envs/py3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1278, in _do_call
return fn(*args)
File "/home/sgnbx/anaconda3/envs/py3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1263, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/sgnbx/anaconda3/envs/py3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1350, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: You must feed a value for placeholder tensor 'input' with dtype float and shape [?,45,50]
[[Node: input = Placeholder[dtype=DT_FLOAT, shape=[?,45,50], _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
[[Node: encoder_lstm/add_16/_25 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_460_encoder_lstm/add_16", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/sgnbx/Downloads/projects/LSTM_autoencoder/justfun.py", line 121, in <module>
k_value = K.eval(encoded)
File "/home/sgnbx/anaconda3/envs/py3/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 671, in eval
return to_dense(x).eval(session=get_session())
File "/home/sgnbx/anaconda3/envs/py3/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 680, in eval
return _eval_using_default_session(self, feed_dict, self.graph, session)
File "/home/sgnbx/anaconda3/envs/py3/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 4951, in _eval_using_default_session
return session.run(tensors, feed_dict)
File "/home/sgnbx/anaconda3/envs/py3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 877, in run
run_metadata_ptr)
File "/home/sgnbx/anaconda3/envs/py3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1100, in _run
feed_dict_tensor, options, run_metadata)
File "/home/sgnbx/anaconda3/envs/py3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1272, in _do_run
run_metadata)
File "/home/sgnbx/anaconda3/envs/py3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1291, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: You must feed a value for placeholder tensor 'input' with dtype float and shape [?,45,50]
[[Node: input = Placeholder[dtype=DT_FLOAT, shape=[?,45,50], _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
[[Node: encoder_lstm/add_16/_25 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_460_encoder_lstm/add_16", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Caused by op 'input', defined at:
File "/home/sgnbx/Downloads/projects/LSTM_autoencoder/justfun.py", line 113, in <module>
inputs = Input(shape=(SEQUENCE_LEN, EMBED_SIZE), name="input")
File "/home/sgnbx/anaconda3/envs/py3/lib/python3.5/site-packages/keras/engine/input_layer.py", line 177, in Input
input_tensor=tensor)
File "/home/sgnbx/anaconda3/envs/py3/lib/python3.5/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/home/sgnbx/anaconda3/envs/py3/lib/python3.5/site-packages/keras/engine/input_layer.py", line 86, in __init__
name=self.name)
File "/home/sgnbx/anaconda3/envs/py3/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 515, in placeholder
x = tf.placeholder(dtype, shape=shape, name=name)
File "/home/sgnbx/anaconda3/envs/py3/lib/python3.5/site-packages/tensorflow/python/ops/array_ops.py", line 1735, in placeholder
return gen_array_ops.placeholder(dtype=dtype, shape=shape, name=name)
File "/home/sgnbx/anaconda3/envs/py3/lib/python3.5/site-packages/tensorflow/python/ops/gen_array_ops.py", line 4925, in placeholder
"Placeholder", dtype=dtype, shape=shape, name=name)
File "/home/sgnbx/anaconda3/envs/py3/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/sgnbx/anaconda3/envs/py3/lib/python3.5/site-packages/tensorflow/python/util/deprecation.py", line 454, in new_func
return func(*args, **kwargs)
File "/home/sgnbx/anaconda3/envs/py3/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3155, in create_op
op_def=op_def)
File "/home/sgnbx/anaconda3/envs/py3/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1717, in __init__
self._traceback = tf_stack.extract_stack()
InvalidArgumentError (see above for traceback): You must feed a value for placeholder tensor 'input' with dtype float and shape [?,45,50]
[[Node: input = Placeholder[dtype=DT_FLOAT, shape=[?,45,50], _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
[[Node: encoder_lstm/add_16/_25 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_460_encoder_lstm/add_16", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Exception ignored in: <bound method BaseSession.__del__ of <tensorflow.python.client.session.Session object at 0x7fd900525c50>>
Traceback (most recent call last):
File "/home/sgnbx/anaconda3/envs/py3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 686, in __del__
TypeError: 'NoneType' object is not callable
Process finished with exit code 1
Very simple way to print a tensor :
from keras import backend as K
k_value = K.eval(tensor)
print(k_value)
UPDATE 1
Create a callback to print at the end of each epoch :
class callback(Callback):
def __init__(self, model, X_train):
self.model = model
self.x = X_train
def on_train_begin(self, logs={}):
return
def on_train_end(self, logs={}):
return
def on_epoch_begin(self, epoch, logs={}):
return
def on_epoch_end(self, epoch, logs={}):
inp = model.input # input placeholder
outputs = model.layers[N].output # get output of N's layer
functors = K.function([inp, K.learning_phase()], [outputs])
layer_outs = functors([self.x, 1.])
print('\r OUTPUT TENSOR : %s' % layer_outs)
return
def on_batch_begin(self, batch, logs={}):
return
def on_batch_end(self, batch, logs={}):
return
Call this function in your fit() method like that :
callbacks=[callback(model = model, X_train = X_train)])
Inspired from Keras, How to get the output of each layer?
Hope this will finally help you !

Using keras with tensorflow "You must feed a value for placeholder tensor 'input_1' with dtype float"

I get the unexpected error "You must feed a value for placeholder tensor 'input_1' with dtype float" when training the discriminator of a GAN
here the error:
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: You must feed a value for placeholder tensor 'input_1' with dtype float
[[Node: input_1 = Placeholder[dtype=DT_FLOAT, shape=[], _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: You must feed a value for placeholder tensor 'input_1' with dtype float
[[Node: input_1 = Placeholder[dtype=DT_FLOAT, shape=[], _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
Traceback (most recent call last):
File "new_model.py", line 204, in <module>
main()
File "new_model.py", line 201, in main
train(nb_epoch=10, BATCH_SIZE=5)
File "new_model.py", line 176, in train
d_loss = discriminator.train_on_batch(image_to_dis, label_to_dis)
File "/usr/local/lib/python2.7/dist-packages/keras/models.py", line 766, in train_on_batch
class_weight=class_weight)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1320, in train_on_batch
outputs = self.train_function(ins)
File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 1943, in __call__
feed_dict=feed_dict)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 766, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 964, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1014, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1034, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: You must feed a value for placeholder tensor 'input_1' with dtype float
[[Node: input_1 = Placeholder[dtype=DT_FLOAT, shape=[], _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
[[Node: moments_4/sufficient_statistics/Shape/_217 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_1267_moments_4/sufficient_statistics/Shape", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
Caused by op u'input_1', defined at:
File "new_model.py", line 204, in <module>
main()
File "new_model.py", line 201, in main
train(nb_epoch=10, BATCH_SIZE=5)
File "new_model.py", line 134, in train
transformer0 = transform_model()
File "new_model.py", line 22, in transform_model
inputs = Input(shape=( 128, 128, 3))
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 1198, in Input
input_tensor=tensor)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 1116, in __init__
name=self.name)
File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 321, in placeholder
x = tf.placeholder(dtype, shape=shape, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/array_ops.py", line 1587, in placeholder
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 2043, in _placeholder
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2240, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1128, in __init__
self._traceback = _extract_stack()
InvalidArgumentError (see above for traceback): You must feed a value for placeholder tensor 'input_1' with dtype float
[[Node: input_1 = Placeholder[dtype=DT_FLOAT, shape=[], _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
[[Node: moments_4/sufficient_statistics/Shape/_217 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_1267_moments_4/sufficient_statistics/Shape", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
it seems the error happens at
d_loss = discriminator.train_on_batch(image_to_dis, label_to_dis)
I'm sure image_to_dis and label_to_dis fit the input of dicriminator
however, the error message here
Caused by op u'input_1', defined at:
File "new_model.py", line 204, in <module>
main()
File "new_model.py", line 201, in main
train(nb_epoch=10, BATCH_SIZE=5)
File "new_model.py", line 134, in train
transformer0 = transform_model()
File "new_model.py", line 22, in transform_model
inputs = Input(shape=( 128, 128, 3))
it says the error is caused by the input tensor of 'transformer'(it is the generator in this GAN).
my code contains something like 'transformer_with_discriminator = discriminator(transformer)', but the discriminator is compiled without the transformer. I think training the discriminator has nothing to do with the input of 'transformer0'
the whole script is a little long, may I put the link of my model here?
https://github.com/wkcw/keras-face-attribute/blob/master/model%26train.py
image_to_dis.dtype and label_to_dis.dtype are both float32, and I've tried to convert label_to_dis.dtype to int
I really have no idea about this......
It comes from the batchnormalization. You can see here : https://stackoverflow.com/a/42470757/7137636 how to fix this issue.
If you need more info, ask in comments :)

How to transform tensors to numpy.ndarray in Keras?

I have got a Tensorflow Tensor within Keras (< class 'tensorflow.python.framework.ops.Tensor'>) which needs to be transformed into a numpy.ndarray.
Is there an easy solution to this task?
I did time-comsuming research via Google and trial/error.
It seems that I have to use tensorflow sessions, but I don't wan't to use any tensorflow sessions, because using them I will receive errors like:
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating
TensorFlow device (/gpu:0) -> (device: 0, name: TITAN X (Pascal), pci bus id: 0000:09:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:1) -> (device: 1, name: TITAN X (Pascal), pci bus id: 0000:05:00.0)
Traceback (most recent call last):
File "mnist_cnn_model_api.py", line 348, in <module>
validation_data=(X_test, Y_test))
File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1596, in fit_generator
callbacks.on_epoch_end(epoch, epoch_logs)
File "/usr/local/lib/python2.7/dist-packages/keras/callbacks.py", line 76, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File "/home/dominik/keras/mnist/my_functions.py", line 737, in on_epoch_end
current_ist = current_is.eval()
File "/home/dominik/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 575, in eval
return _eval_using_default_session(self, feed_dict, self.graph, session)
File "/home/dominik/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3633, in _eval_using_default_session
return session.run(tensors, feed_dict)
File "/home/dominik/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 766, in run
run_metadata_ptr)
File "/home/dominik/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 964, in _run
feed_dict_string, options, run_metadata)
File "/home/dominik/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1014, in _do_run
target_list, options, run_metadata)
File "/home/dominik/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1034, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: You must feed a value for placeholder tensor 'input_1' with dtype float
[[Node: input_1 = Placeholder[dtype=DT_FLOAT, shape=[], _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
[[Node: strided_slice_1/_1 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_6_strided_slice_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
Caused by op u'input_1', defined at:
File "mnist_cnn_model_api.py", line 113, in <module>
inputs = Input(shape=input_shape)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 1198, in Input
input_tensor=tensor)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 1116, in __init__
name=self.name)
File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 309, in placeholder
x = tf.placeholder(dtype, shape=shape, name=name)
File "/home/dominik/.local/lib/python2.7/site-packages/tensorflow/python/ops/array_ops.py", line 1587, in placeholder
name=name)
File "/home/dominik/.local/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 2043, in _placeholder
name=name)
File "/home/dominik/.local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op
op_def=op_def)
File "/home/dominik/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2240, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/dominik/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1128, in __init__
self._traceback = _extract_stack()
InvalidArgumentError (see above for traceback): You must feed a value for placeholder tensor 'input_1' with dtype float
[[Node: input_1 = Placeholder[dtype=DT_FLOAT, shape=[], _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
[[Node: strided_slice_1/_1 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_6_strided_slice_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
I need this placeholder in line 113 - it doesn't make troubles when using Keras without any tensorflow session. But as soon as I am trying to use a tensorflow session (in order to use .eval() for transforming the tensor into numpy.ndarray) my code explodes...
I appreciate your help!

OOM Error running resnet model tensorflow

I am running the resenet model on EC2 g2(NVIDIA GRID K520) instance and seeing a OOM Error. I have tried various combinations of removing the code that uses a GPU, prefixing CUDA_VISIBLE_DEVICES='0' and also reducing the batch_size to 64. I am still failing to start the training. Can you help?
W tensorflow/core/common_runtime/bfc_allocator.cc:270] **********************x***************************************************************************xx
W tensorflow/core/common_runtime/bfc_allocator.cc:271] Ran out of memory trying to allocate 196.00MiB. See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:936] Resource exhausted: OOM when allocating tensor with shape[64,16,224,224]
E tensorflow/core/client/tensor_c_api.cc:485] OOM when allocating tensor with shape[64,16,224,224]
[[Node: unit_1_2/sub1/conv1/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](unit_1_2/residual_only_activation/leaky_relu, unit_1_2/sub1/conv1/DW/read)]]
[[Node: train_step/update/_1561 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_10115_train_step/update", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
Traceback (most recent call last):
File "./resnet_main.py", line 203, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv))
File "./resnet_main.py", line 197, in main
train(hps)
File "./resnet_main.py", line 82, in train
feed_dict={model.lrn_rate: lrn_rate})
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 382, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 655, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 723, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 743, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.ResourceExhaustedError: OOM when allocating tensor with shape[64,16,224,224]
[[Node: unit_1_2/sub1/conv1/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](unit_1_2/residual_only_activation/leaky_relu, unit_1_2/sub1/conv1/DW/read)]]
[[Node: train_step/update/_1561 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_10115_train_step/update", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
Caused by op u'unit_1_2/sub1/conv1/Conv2D', defined at:
File "./resnet_main.py", line 203, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv))
File "./resnet_main.py", line 197, in main
train(hps)
File "./resnet_main.py", line 64, in train
model.build_graph()
File "/home/ubuntu/indu/tf-benchmark/resnet/resnet_model.py", line 59, in build_graph
self._build_model()
File "/home/ubuntu/indu/tf-benchmark/resnet/resnet_model.py", line 94, in _build_model
x = res_func(x, filters[1], filters[1], self._stride_arr(1), False)
File "/home/ubuntu/indu/tf-benchmark/resnet/resnet_model.py", line 208, in _residual
x = self._conv('conv1', x, 3, in_filter, out_filter, stride)
File "/home/ubuntu/indu/tf-benchmark/resnet/resnet_model.py", line 279, in _conv
return tf.nn.conv2d(x, kernel, strides, padding='SAME')
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 394, in conv2d
data_format=data_format, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 703, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2310, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1232, in __init__
self._traceback = _extract_stack()
The NVIDIA GRID K520 has 8GB of memory (link). I have successfully trained ResNet models on a NVIDIA GPU with 12GB of memory. As the error suggests, TensorFlow attempts to put all of the networks weights in the GPU memory and fails. I believe you have a few options:
Train only on the CPU, as mentioned in the comments, assuming your CPU has more than 8GB of memory. This is not recommended.
Train a different network with fewer parameters. Several networks have been released since Resnet, such as Inception-v4, Inception-ResNet, with fewer parameters and comparable accuracy. This option costs nothing to try!
Buy a GPU with more memory. Easiest option if you have the money.
Buy another GPU with the same memory and train the bottom half of the network on one, and the top half of the network on the other. The difficulty in communicating between the GPUs makes this option less desirable.
I hope this helps you and others that run into similar memory issues.