"Create kernel failed" in Tensorflow when mapping nodes to different devices

"Create kernel failed" in Tensorflow when mapping nodes to different devices - tensorflow

I have done some manual graph partitioning in tensorFlow using a simple hash function to map the nodes on different (2) CPU devices.
When I map the whole graph on the first device or the second it works. (That is why I don't understand the error message "Create kernel failed").
Do you have any idea what is wrong?
However the following error occurs:
E tensorflow/core/framework/op_segment.cc:53] Create kernel failed: Invalid argument: AttrValue must not have reference type value of float_ref
for attr 'tensor_type'
; NodeDef: Variable/_9 = _Recv[_start_time=0, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/cpu:1", send_device_incarnation=1, tensor_name="edge_10_Variable", tensor_type=DT_FLOAT_REF, _device="/job:localhost/replica:0/task:0/cpu:0"](^zeros/_11); Op<name=_Recv; signature= -> tensor:tensor_type; attr=tensor_type:type; attr=tensor_name:string; attr=send_device:string; attr=send_device_incarnation:int; attr=recv_device:string; attr=client_terminated:bool,default=false; is_stateful=true>
E tensorflow/core/common_runtime/executor.cc:334] Executor failed to create kernel. Invalid argument: AttrValue must not have reference type value of float_ref
for attr 'tensor_type'
; NodeDef: Variable/_9 = _Recv[_start_time=0, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/cpu:1", send_device_incarnation=1, tensor_name="edge_10_Variable", tensor_type=DT_FLOAT_REF, _device="/job:localhost/replica:0/task:0/cpu:0"](^zeros/_11); Op<name=_Recv; signature= -> tensor:tensor_type; attr=tensor_type:type; attr=tensor_name:string; attr=send_device:string; attr=send_device_incarnation:int; attr=recv_device:string; attr=client_terminated:bool,default=false; is_stateful=true>
[[Node: Variable/_9 = _Recv[_start_time=0, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/cpu:1", send_device_incarnation=1, tensor_name="edge_10_Variable", tensor_type=DT_FLOAT_REF, _device="/job:localhost/replica:0/task:0/cpu:0"](^zeros/_11)]]
Traceback (most recent call last):
File "/Users/larissa/Desktop/GraphPartSched/Theorie/TensorFlow/TensorFlow_Tutorials/MNIST_For_ML_Beginners.py", line 53, in <module>
sess.run(init)
File "/Users/larissa/tensorflowSource/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 717, in run
run_metadata_ptr)
File "/Users/larissa/tensorflowSource/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 915, in _run
feed_dict_string, options, run_metadata)
File "/Users/larissa/tensorflowSource/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 965, in _do_run
target_list, options, run_metadata)
File "/Users/larissa/tensorflowSource/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 985, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.InvalidArgumentError: AttrValue must not have reference type value of float_ref
for attr 'tensor_type'
; NodeDef: Variable/_9 = _Recv[_start_time=0, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/cpu:1", send_device_incarnation=1, tensor_name="edge_10_Variable", tensor_type=DT_FLOAT_REF, _device="/job:localhost/replica:0/task:0/cpu:0"](^zeros/_11); Op<name=_Recv; signature= -> tensor:tensor_type; attr=tensor_type:type; attr=tensor_name:string; attr=send_device:string; attr=send_device_incarnation:int; attr=recv_device:string; attr=client_terminated:bool,default=false; is_stateful=true>
[[Node: Variable/_9 = _Recv[_start_time=0, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/cpu:1", send_device_incarnation=1, tensor_name="edge_10_Variable", tensor_type=DT_FLOAT_REF, _device="/job:localhost/replica:0/task:0/cpu:0"](^zeros/_11)]]

Related

TensorFlow tfrecord edit (read and then write to file) OutOfRange error

In src/datasets/h36m_edit.py:
with tf.Session() as sess:
reader = tf.TFRecordReader()
coder = ImageCoder()
fqueue = tf.train.string_input_producer(files, num_epochs=1, shuffle=False, name="input")
_, example_serialized = reader.read(fqueue)
sess.run(tf.local_variables_initializer())
sess.run(tf.global_variables_initializer())
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
fidx = 0
total_imgs = 0
image, image_size, label, center, fname, pose, shape, gt3d, has_smpl3d = parse_example_proto(example_serialized)
while not coord.should_stop():
fidx += 1
tf_filename = out_path% fidx
print('Starting tfrecord file %s \n' % tf_filename)
with tf.python_io.TFRecordWriter(tf_filename) as writer:
for i in tqdm(range(train_shards)): # min(train_shards, image_bs.shape[0])
image_v, image_size_v, label_v, center_v, fname_v, pose_v, shape_v, gt3d_v, has_smpl3d_v = sess.run(
[image, image_size, label, center, fname, pose, shape, gt3d, has_smpl3d])
image_s = coder.encode_jpeg(image_v)
example = convert_to_example_wmosh(image_s, fname_v, image_size_v[0], image_size_v[1],
label_v, center_v, gt3d_v, pose_v, shape_v)
writer.write(example.SerializeToString())
total_imgs += 1
coord.request_stop()
coord.join(threads)
Sometimes the inner loop stops before it reaches the maximum iter limit (train_shards) 500.
100%|██████████| 500/500 [00:02<00:00, 225.07it/s]
Starting tfrecord file /home/cdeng/tf_datasets/tf_records_human36m_wjoints/train_modified/train_0011.tfrecord
96%|█████████▌| 478/500 [00:02<00:00, 225.58it/s]Starting tfrecord file /home/cdeng/tf_datasets/tf_records_human36m_wjoints/train_modified/train_0012.tfrecord
100%|██████████| 500/500 [00:02<00:00, 230.37it/s]
And when it writes to the number 625 tfrecord file, there is OutOfRange error (it supposes to finish with more than 3000 tfrecord files, cause human36m train has 1559985 images and each tfrecord contains 500 images). I guess it's because the image queue is not handled correctly, maybe the producer is too slow?
/home/cdeng/tf_datasets/tf_records_human36m_wjoints/train_modified/train_0625.tfrecord
36%|███▌ | 180/500 [00:00<00:01, 221.50it/s]2019-01-13 22:47:40.946736: W tensorflow/core/framework/op_kernel.cc:1192] Out of range: FIFOQueue '_0_input' is closed and has insufficient elements (requested 1, current size 0)
[[Node: ReaderReadV2 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/cpu:0"](TFRecordReaderV2, input)]]
2019-01-13 22:47:40.946816: W tensorflow/core/framework/op_kernel.cc:1192] Out of range: FIFOQueue '_0_input' is closed and has insufficient elements (requested 1, current size 0)
[[Node: ReaderReadV2 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/cpu:0"](TFRecordReaderV2, input)]]
Traceback (most recent call last):
File "/home/cdeng/star_repos/hmr/src/datasets/h36m_edit.py", line 233, in <module>
[image, image_size, label, center, fname, pose, shape, gt3d, has_smpl3d])
File "/home/cdeng/.virtualenvs/hmr/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/home/cdeng/.virtualenvs/hmr/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1124, in _run
feed_dict_tensor, options, run_metadata)
File "/home/cdeng/.virtualenvs/hmr/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run
options, run_metadata)
File "/home/cdeng/.virtualenvs/hmr/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.OutOfRangeError: FIFOQueue '_0_input' is closed and has insufficient elements (requested 1, current size 0)
[[Node: ReaderReadV2 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/cpu:0"](TFRecordReaderV2, input)]]
[[Node: ParseSingleExample/ParseExample/ParseExample/_21 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_52_ParseSingleExample/ParseExample/ParseExample", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
Caused by op u'ReaderReadV2', defined at:
File "/home/cdeng/star_repos/hmr/src/datasets/h36m_edit.py", line 204, in <module>
_, example_serialized = reader.read(fqueue)
File "/home/cdeng/.virtualenvs/hmr/local/lib/python2.7/site-packages/tensorflow/python/ops/io_ops.py", line 194, in read
return gen_io_ops._reader_read_v2(self._reader_ref, queue_ref, name=name)
File "/home/cdeng/.virtualenvs/hmr/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 423, in _reader_read_v2
queue_handle=queue_handle, name=name)
File "/home/cdeng/.virtualenvs/hmr/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/home/cdeng/.virtualenvs/hmr/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2630, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/cdeng/.virtualenvs/hmr/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1204, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
OutOfRangeError (see above for traceback): FIFOQueue '_0_input' is closed and has insufficient elements (requested 1, current size 0)
[[Node: ReaderReadV2 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/cpu:0"](TFRecordReaderV2, input)]]
[[Node: ParseSingleExample/ParseExample/ParseExample/_21 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_52_ParseSingleExample/ParseExample/ParseExample", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
Process finished with exit code 1

Problem solved.
Two comments:
tqdm could be wrong when the loop is running too fast.
when the queue is empty at the end, OutOfRange error will be thrown out, a good practice is to add exception handling as suggested in QueueRunner

I run the code in distributed mode and my code run good in asynchronous mode; but the code run unsuccessful in synchronous mode

I run the code in distributed mode and my code run good in asynchronous mode; but the code run unsuccessful in synchronous mode.
opt = tf.train.MomentumOptimizer(learning_rate=lr_placeholder, momentum=0.9) opt=tf.train.SyncReplicasOptimizer(opt,replicas_to_aggregate=len(worker_hosts),total_num_replicas=len(worker_hosts),use_locking=True)
train_op = opt.minimize(full_loss, global_step=global_step)
val_op = validation_op(validation_step, vali_top1_error, vali_loss)
sync_replicas_hook = opt.make_session_run_hook(True)
init=tf.global_variables_initializer()
with training.MonitoredTrainingSession(master=server.target, is_chief=True,hooks=[sync_replicas_hook]) as sess:
Traceback (most recent call last):
File "/usr/local/python3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1
292, in _do_call return fn(*args)
File "/usr/local/python3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1
277, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata)
File "/usr/local/python3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1
367, in _call_tf_sessionrun run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: NodeDef missing attr 'reduction_type'
from Op handle:Ref(string); attr=dtype:type,allowed=[DT_FLOAT, DT_DOUBLE, DT_INT32, DT_UINT8, DT_INT16, ..., DT_UINT16, DT_COMPLEX128, DT_HALF, DT_UINT32, DT_UINT64]; attr=shape:shape; attr=container:string,default=""; attr=shared_name:string,default=""; attr=reduction_type:string,default="MEAN",allowed=["MEAN", "SUM"]; is_stateful=true>; NodeDef: {{node sync_replicas/conditional_accumulator}} = ConditionalAccumulator_class=["loc:#sync_replicas/SetGlobalStep"], container="", dtype=DT_FLOAT, shape=[3,3,3,16], shared_name="conv0/conv:0/grad_accum", _device="/job:ps/replica:0/task:0/device:CPU:0"
During handling of the above exception, another exception occurred:

tensorflow.python.framework.errors_impl.InvalidArgumentError: NodeDef missing attr 'reduction_type'
from Op handle:Ref(string); attr=dtype:type,allowed=[DT_FLOAT, DT_DOUBLE, DT_INT32, DT_UINT8, DT_INT16, ..., DT_UINT16, DT_COMPLEX128, DT_HALF, DT_UINT32, DT_UINT64]; attr=shape:shape; attr=container:string,default=""; attr=shared_name:string,default=""; attr=reduction_type:string,default="MEAN",allowed=["MEAN", "SUM"]; is_stateful=true>; NodeDef: {{node sync_replicas/conditional_accumulator}} = ConditionalAccumulator_class=["loc:#sync_replicas/SetGlobalStep"], container="", dtype=DT_FLOAT, shape=[3,3,3,16], shared_name="conv0/conv:0/grad_accum", _device="/job:ps/replica:0/task:0/device:CPU:0"

assertion failed error when using tensorflow object detection API to fine tune the mask_rcnn_inception_resnet_v2_atrous_coco model

I was trying to use tensorflow object detection API to fine tune the mask_rcnn_inception_resnet_v2_atrous_coco model and use it to train on the MIO-TCD dataset. I converted the MIO-TCD dataset into TFRecord.
However, I got stuck with the following InvalidArgumentError:
INFO:tensorflow:Error reported to Coordinator: assertion failed: [] [Condition x == y did not hold element-wise:] [x (Loss/BoxClassifierLoss/assert_equal_2/x:0) = ] [0] [y (Loss/BoxClassifierLoss/assert_equal_2/y:0) = ] [5]
[[Node: Loss/BoxClassifierLoss/assert_equal_2/Assert/Assert = Assert[T=[DT_STRING, DT_STRING, DT_STRING, DT_INT32, DT_STRING, DT_INT32], summarize=3, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Loss/BoxClassifierLoss/assert_equal_2/All/_155, Loss/RPNLoss/assert_equal/Assert/Assert/data_0, Loss/RPNLoss/assert_equal/Assert/Assert/data_1, Loss/BoxClassifierLoss/assert_equal_2/Assert/Assert/data_2, Loss/BoxClassifierLoss/assert_equal_2/x/_157, Loss/BoxClassifierLoss/assert_equal_2/Assert/Assert/data_4, Loss/RPNLoss/ones_1/shape/_147)]]
[[Node: FirstStageFeatureExtractor/InceptionResnetV2/Mixed_5b/Branch_2/Conv2d_0a_1x1/BatchNorm/moving_mean/read/_225 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_2248_FirstStageFeatureExtractor/InceptionResnetV2/Mixed_5b/Branch_2/Conv2d_0a_1x1/BatchNorm/moving_mean/read", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
Caused by op 'Loss/BoxClassifierLoss/assert_equal_2/Assert/Assert', defined at:
File "train.py", line 167, in <module>
tf.app.run()
File "C:\Users\hedey\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\platform\app.py", line 124, in run
_sys.exit(main(argv))
File "train.py", line 163, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "C:\Users\hedey\models\research\object_detection\trainer.py", line 246, in train
clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue])
File "C:\Users\hedey\models\research\deployment\model_deploy.py", line 193, in create_clones
outputs = model_fn(*args, **kwargs)
File "C:\Users\hedey\models\research\object_detection\trainer.py", line 181, in _create_losses
losses_dict = detection_model.loss(prediction_dict, true_image_shapes)
File "C:\Users\hedey\models\research\object_detection\meta_architectures\faster_rcnn_meta_arch.py", line 1580, in loss
groundtruth_masks_list,
File "C:\Users\hedey\models\research\object_detection\meta_architectures\faster_rcnn_meta_arch.py", line 1813, in _loss_box_classifier
groundtruth_boxlists, groundtruth_masks_list)
File "C:\Users\hedey\models\research\object_detection\core\target_assigner.py", line 447, in batch_assign_targets
anchors, gt_boxes, gt_class_targets, gt_weights)
File "C:\Users\hedey\models\research\object_detection\core\target_assigner.py", line 151, in assign
groundtruth_boxes.get())[:1])
File "C:\Users\hedey\models\research\object_detection\utils\shape_utils.py", line 279, in assert_shape_equal
return tf.assert_equal(shape_a, shape_b)
File "C:\Users\hedey\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\ops\check_ops.py", line 392, in assert_equal
return control_flow_ops.Assert(condition, data, summarize=summarize)
File "C:\Users\hedey\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\util\tf_should_use.py", line 118, in wrapped
return _add_should_use_warning(fn(*args, **kwargs))
File "C:\Users\hedey\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\ops\control_flow_ops.py", line 169, in Assert
condition, data, summarize, name="Assert")
File "C:\Users\hedey\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\ops\gen_logging_ops.py", line 48, in _assert
name=name)
File "C:\Users\hedey\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "C:\Users\hedey\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\framework\ops.py", line 3160, in create_op
op_def=op_def)
File "C:\Users\hedey\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\framework\ops.py", line 1625, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InvalidArgumentError (see above for traceback): assertion failed: [] [Condition x == y did not hold element-wise:] [x (Loss/BoxClassifierLoss/assert_equal_2/x:0) = ] [0] [y (Loss/BoxClassifierLoss/assert_equal_2/y:0) = ] [5]
[[Node: Loss/BoxClassifierLoss/assert_equal_2/Assert/Assert = Assert[T=[DT_STRING, DT_STRING, DT_STRING, DT_INT32, DT_STRING, DT_INT32], summarize=3, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Loss/BoxClassifierLoss/assert_equal_2/All/_155, Loss/RPNLoss/assert_equal/Assert/Assert/data_0, Loss/RPNLoss/assert_equal/Assert/Assert/data_1, Loss/BoxClassifierLoss/assert_equal_2/Assert/Assert/data_2, Loss/BoxClassifierLoss/assert_equal_2/x/_157, Loss/BoxClassifierLoss/assert_equal_2/Assert/Assert/data_4, Loss/RPNLoss/ones_1/shape/_147)]]
[[Node: FirstStageFeatureExtractor/InceptionResnetV2/Mixed_5b/Branch_2/Conv2d_0a_1x1/BatchNorm/moving_mean/read/_225 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_2248_FirstStageFeatureExtractor/InceptionResnetV2/Mixed_5b/Branch_2/Conv2d_0a_1x1/BatchNorm/moving_mean/read", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
Traceback (most recent call last):
File "C:\Users\hedey\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\client\session.py", line 1350, in _do_call
return fn(*args)
File "C:\Users\hedey\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\client\session.py", line 1329, in _run_fn
status, run_metadata)
File "C:\Users\hedey\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 473, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: assertion failed: [] [Condition x == y did not hold element-wise:] [x (Loss/BoxClassifierLoss/assert_equal_2/x:0) = ] [0] [y (Loss/BoxClassifierLoss/assert_equal_2/y:0) = ] [5]
[[Node: Loss/BoxClassifierLoss/assert_equal_2/Assert/Assert = Assert[T=[DT_STRING, DT_STRING, DT_STRING, DT_INT32, DT_STRING, DT_INT32], summarize=3, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Loss/BoxClassifierLoss/assert_equal_2/All/_155, Loss/RPNLoss/assert_equal/Assert/Assert/data_0, Loss/RPNLoss/assert_equal/Assert/Assert/data_1, Loss/BoxClassifierLoss/assert_equal_2/Assert/Assert/data_2, Loss/BoxClassifierLoss/assert_equal_2/x/_157, Loss/BoxClassifierLoss/assert_equal_2/Assert/Assert/data_4, Loss/RPNLoss/ones_1/shape/_147)]]
[[Node: FirstStageFeatureExtractor/InceptionResnetV2/Mixed_5b/Branch_2/Conv2d_0a_1x1/BatchNorm/moving_mean/read/_225 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_2248_FirstStageFeatureExtractor/InceptionResnetV2/Mixed_5b/Branch_2/Conv2d_0a_1x1/BatchNorm/moving_mean/read", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
I found that other people posted about the same problem in more than one of the github issues. The following is an example, and I already commented there and was advised to post about it on stackoverflow:
https://github.com/tensorflow/models/issues/3972#issuecomment-381535604
https://github.com/tensorflow/models/issues/3972#issuecomment-381535604

When you convert the MIO-TCD dataset into TFRecord,you should set include_masks parameter like this.
--include_masks=True
You can try.

The problem is in your tfrecord file created with the create_pet_tf_record.py program, you need to create it using the following arg --faces_only set to false because if you leave it to True (default value), then no segmentation will be supplied whereas this is what you are trying to train.
Check this : https://github.com/tensorflow/models/issues/3972

InvalidArgumentError: ConcatOp : Dimensions of inputs should match: shape[0] = [1,246,381,3] vs. shape[1] = [1,252,367,3]

This is my code snippet, how I am doing conatenate all train images (left and right and mask seperatly). In the the variables l, r tensores with the shape of [4, ?, ?, 3] are assigned.
with tf.Session() as session:
l_train = [x.l_img for x in images][:4]
r_train = [x.r_img for x in images][:4]
m_train = [x.mask for x in images][:4]
l = tf.concat(l_train, 0)
r = tf.concat(r_train, 0)
m = tf.concat(m_train, 0)
l.eval()
When using eval() I got this error?
Traceback (most recent call last):
File "/home/test/anaconda2/envs/tensorflow/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2881, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-5-f78dccf94f7f>", line 1, in <module>
l.eval()
File "/home/test/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 606, in eval
return _eval_using_default_session(self, feed_dict, self.graph, session)
File "/home/test/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3928, in _eval_using_default_session
return session.run(tensors, feed_dict)
File "/home/test/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 789, in run
run_metadata_ptr)
File "/home/test/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 997, in _run
feed_dict_string, options, run_metadata)
File "/home/test/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1132, in _do_run
target_list, options, run_metadata)
File "/home/test/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1152, in _do_call
raise type(e)(node_def, op, message)
InvalidArgumentError: ConcatOp : Dimensions of inputs should match: shape[0] = [1,246,381,3] vs. shape[1] = [1,252,367,3]
[[Node: concat = ConcatV2[N=4, T=DT_FLOAT, Tidx=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](Reading/reshape_t_left/_1, Reading/reshape_t_left_1/_3, Reading/reshape_t_left_2/_5, Reading/reshape_t_left_3/_7, concat/axis)]]
[[Node: concat/_9 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_370_concat", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
How train my training set with dynamic patchsizes? Meanwhile, loop over my images and feed my CNN with one image after another.
_, summary_str, costs = sess.run([optimizer, merged_summary_op, cost_function],
feed_dict={t_im0: l.eval(), t_im1: r.eval(),
t_label: m.eval()})

I'm having exactly the same issue and I think it is because the batch size is 1 on the paper of Faster R-CNN.

Unable to freeze inception v1 (tf-slim) graph

After successfully running all examples from slim walkthrough notebook, I wanted to freeze the graph. In order to do that, I ran the following (copy from original notebook):
import os
from datasets import flowers
from nets import inception
from preprocessing import inception_preprocessing
slim = tf.contrib.slim
image_size = inception.inception_v1.default_image_size
def get_init_fn():
"""Returns a function run by the chief worker to warm-start the training."""
checkpoint_exclude_scopes=["InceptionV1/Logits", "InceptionV1/AuxLogits"]
exclusions = [scope.strip() for scope in checkpoint_exclude_scopes]
variables_to_restore = []
for var in slim.get_model_variables():
excluded = False
for exclusion in exclusions:
if var.op.name.startswith(exclusion):
excluded = True
break
if not excluded:
variables_to_restore.append(var)
return slim.assign_from_checkpoint_fn(
os.path.join(checkpoints_dir, 'inception_v1.ckpt'),
variables_to_restore)
train_dir = '/tmp/inception_finetuned/'
with tf.Graph().as_default():
tf.logging.set_verbosity(tf.logging.INFO)
dataset = flowers.get_split('train', flowers_data_dir)
images, _, labels = load_batch(dataset, height=image_size, width=image_size)
# Create the model, use the default arg scope to configure the batch norm parameters.
with slim.arg_scope(inception.inception_v1_arg_scope()):
logits, _ = inception.inception_v1(images, num_classes=dataset.num_classes, is_training=True)
# Specify the loss function:
one_hot_labels = slim.one_hot_encoding(labels, dataset.num_classes)
slim.losses.softmax_cross_entropy(logits, one_hot_labels)
total_loss = slim.losses.get_total_loss()
# Create some summaries to visualize the training process:
tf.scalar_summary('losses/Total Loss', total_loss)
# Specify the optimizer and create the train op:
optimizer = tf.train.AdamOptimizer(learning_rate=0.01)
train_op = slim.learning.create_train_op(total_loss, optimizer)
# Run the training:
final_loss = slim.learning.train(
train_op,
logdir=train_dir,
init_fn=get_init_fn(),
number_of_steps=2)
print('Finished training. Last batch loss %f' % final_loss)
The code above produced the following files in /tmp/inception_finetuned folder:
checkpoint
model.ckpt-0.meta
events.out.tfevents.1478081437.Nikos-MacBook-Pro.local
model.ckpt-2 graph.pbtxt
model.ckpt-2.meta model.ckpt-0
Then, in order to freeze the graph, I ran the following command:
bazel-bin/tensorflow/python/tools/freeze_graph --input_graph=/tmp/inception_finetuned/graph.pbtxt --input_checkpoint=/tmp/inception_finetuned/model.ckpt-2 --output_graph=/tmp/freeze.pb --output_node_names=InceptionV1/Logits/Predictions/Softmax
The command, however, produced the following error:
W tensorflow/core/framework/op_kernel.cc:968] Failed precondition: Attempting to use uninitialized value InceptionV1/Logits/Conv2d_0c_1x1/biases/Adam_1
[[Node: _send_InceptionV1/Logits/Conv2d_0c_1x1/biases/Adam_1_0 = _Send[T=DT_FLOAT, client_terminated=true, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=6007788667487390928, tensor_name="InceptionV1/Logits/Conv2d_0c_1x1/biases/Adam_1:0", _device="/job:localhost/replica:0/task:0/cpu:0"](InceptionV1/Logits/Conv2d_0c_1x1/biases/Adam_1)]]
...
W tensorflow/core/framework/op_kernel.cc:968] Failed precondition: Attempting to use uninitialized value InceptionV1/Logits/Conv2d_0c_1x1/biases/Adam_1
[[Node: _send_InceptionV1/Logits/Conv2d_0c_1x1/biases/Adam_1_0 = _Send[T=DT_FLOAT, client_terminated=true, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=6007788667487390928, tensor_name="InceptionV1/Logits/Conv2d_0c_1x1/biases/Adam_1:0", _device="/job:localhost/replica:0/task:0/cpu:0"](InceptionV1/Logits/Conv2d_0c_1x1/biases/Adam_1)]]
Traceback (most recent call last):
File "/Users/nikogamulin/workspace/tensorflow/bazel-bin/tensorflow/python/tools/freeze_graph.runfiles/org_tensorflow/tensorflow/python/tools/freeze_graph.py", line 135, in <module>
tf.app.run()
File "/Users/nikogamulin/workspace/tensorflow/bazel-bin/tensorflow/python/tools/freeze_graph.runfiles/org_tensorflow/tensorflow/python/platform/app.py", line 32, in run
sys.exit(main(sys.argv[:1] + flags_passthrough))
File "/Users/nikogamulin/workspace/tensorflow/bazel-bin/tensorflow/python/tools/freeze_graph.runfiles/org_tensorflow/tensorflow/python/tools/freeze_graph.py", line 132, in main
FLAGS.output_graph, FLAGS.clear_devices, FLAGS.initializer_nodes)
File "/Users/nikogamulin/workspace/tensorflow/bazel-bin/tensorflow/python/tools/freeze_graph.runfiles/org_tensorflow/tensorflow/python/tools/freeze_graph.py", line 121, in freeze_graph
sess, input_graph_def, output_node_names.split(","))
File "/Users/nikogamulin/workspace/tensorflow/bazel-bin/tensorflow/python/tools/freeze_graph.runfiles/org_tensorflow/tensorflow/python/framework/graph_util.py", line 226, in convert_variables_to_constants
returned_variables = sess.run(variable_names)
File "/Users/nikogamulin/workspace/tensorflow/bazel-bin/tensorflow/python/tools/freeze_graph.runfiles/org_tensorflow/tensorflow/python/client/session.py", line 717, in run
run_metadata_ptr)
File "/Users/nikogamulin/workspace/tensorflow/bazel-bin/tensorflow/python/tools/freeze_graph.runfiles/org_tensorflow/tensorflow/python/client/session.py", line 915, in _run
feed_dict_string, options, run_metadata)
File "/Users/nikogamulin/workspace/tensorflow/bazel-bin/tensorflow/python/tools/freeze_graph.runfiles/org_tensorflow/tensorflow/python/client/session.py", line 965, in _do_run
target_list, options, run_metadata)
File "/Users/nikogamulin/workspace/tensorflow/bazel-bin/tensorflow/python/tools/freeze_graph.runfiles/org_tensorflow/tensorflow/python/client/session.py", line 985, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.FailedPreconditionError: Attempting to use uninitialized value InceptionV1/Logits/Conv2d_0c_1x1/biases/Adam_1
[[Node: _send_InceptionV1/Logits/Conv2d_0c_1x1/biases/Adam_1_0 = _Send[T=DT_FLOAT, client_terminated=true, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=6007788667487390928, tensor_name="InceptionV1/Logits/Conv2d_0c_1x1/biases/Adam_1:0", _device="/job:localhost/replica:0/task:0/cpu:0"](InceptionV1/Logits/Conv2d_0c_1x1/biases/Adam_1)]]
Then I tried to use different optimizer:
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)
and got the following error:
W tensorflow/core/framework/op_kernel.cc:968] Failed precondition: Attempting to use uninitialized value global_step
[[Node: _send_global_step_0 = _Send[T=DT_INT64, client_terminated=true, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=8900174487477528080, tensor_name="global_step:0", _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
...
[[Node: _send_global_step_0 = _Send[T=DT_INT64, client_terminated=true, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=8900174487477528080, tensor_name="global_step:0", _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
W tensorflow/core/framework/op_kernel.cc:968] Failed precondition: Attempting to use uninitialized value global_step
[[Node: _send_global_step_0 = _Send[T=DT_INT64, client_terminated=true, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=8900174487477528080, tensor_name="global_step:0", _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
Traceback (most recent call last):
File "/Users/nikogamulin/workspace/tensorflow/bazel-bin/tensorflow/python/tools/freeze_graph.runfiles/org_tensorflow/tensorflow/python/tools/freeze_graph.py", line 135, in <module>
tf.app.run()
File "/Users/nikogamulin/workspace/tensorflow/bazel-bin/tensorflow/python/tools/freeze_graph.runfiles/org_tensorflow/tensorflow/python/platform/app.py", line 32, in run
sys.exit(main(sys.argv[:1] + flags_passthrough))
File "/Users/nikogamulin/workspace/tensorflow/bazel-bin/tensorflow/python/tools/freeze_graph.runfiles/org_tensorflow/tensorflow/python/tools/freeze_graph.py", line 132, in main
FLAGS.output_graph, FLAGS.clear_devices, FLAGS.initializer_nodes)
File "/Users/nikogamulin/workspace/tensorflow/bazel-bin/tensorflow/python/tools/freeze_graph.runfiles/org_tensorflow/tensorflow/python/tools/freeze_graph.py", line 121, in freeze_graph
sess, input_graph_def, output_node_names.split(","))
File "/Users/nikogamulin/workspace/tensorflow/bazel-bin/tensorflow/python/tools/freeze_graph.runfiles/org_tensorflow/tensorflow/python/framework/graph_util.py", line 226, in convert_variables_to_constants
returned_variables = sess.run(variable_names)
File "/Users/nikogamulin/workspace/tensorflow/bazel-bin/tensorflow/python/tools/freeze_graph.runfiles/org_tensorflow/tensorflow/python/client/session.py", line 717, in run
run_metadata_ptr)
File "/Users/nikogamulin/workspace/tensorflow/bazel-bin/tensorflow/python/tools/freeze_graph.runfiles/org_tensorflow/tensorflow/python/client/session.py", line 915, in _run
feed_dict_string, options, run_metadata)
File "/Users/nikogamulin/workspace/tensorflow/bazel-bin/tensorflow/python/tools/freeze_graph.runfiles/org_tensorflow/tensorflow/python/client/session.py", line 965, in _do_run
target_list, options, run_metadata)
File "/Users/nikogamulin/workspace/tensorflow/bazel-bin/tensorflow/python/tools/freeze_graph.runfiles/org_tensorflow/tensorflow/python/client/session.py", line 985, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.FailedPreconditionError: Attempting to use uninitialized value global_step
[[Node: _send_global_step_0 = _Send[T=DT_INT64, client_terminated=true, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=8900174487477528080, tensor_name="global_step:0", _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
Similarly, if I retrain the model running the following command:
python train_image_classifier.py \
--train_dir=${TRAIN_DIR} \
--dataset_dir=${DATASET_DIR} \
--dataset_name=flowers \
--dataset_split_name=train \
--model_name=inception_v1 \
--checkpoint_path=${CHECKPOINT_PATH} \
--checkpoint_exclude_scopes=InceptionV1/Logits,InceptionV1/AuxLogits/Logits \
--trainable_scopes=InceptionV1/Logits,InceptionV1/AuxLogits/Logits
and try to freeze the graph, get the errors related to
global_step
Does anyone know why the above errors occur and how to solve them? If anyone managed to freeze inception v1 (tf-slim) graph, I would be thankful for any suggestions that might solve the issue.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

"Create kernel failed" in Tensorflow when mapping nodes to different devices - tensorflow

Related

TensorFlow tfrecord edit (read and then write to file) OutOfRange error

I run the code in distributed mode and my code run good in asynchronous mode; but the code run unsuccessful in synchronous mode

assertion failed error when using tensorflow object detection API to fine tune the mask_rcnn_inception_resnet_v2_atrous_coco model

InvalidArgumentError: ConcatOp : Dimensions of inputs should match: shape[0] = [1,246,381,3] vs. shape[1] = [1,252,367,3]

Unable to freeze inception v1 (tf-slim) graph

Categories

Resources