I am trying to run the retrain.py script (available here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/image_retraining/retrain.py). I have noticed that the part starting with the line 747 is executed on CPU when the default should be GPU. So, I have added the following line to force it to work on GPU:
`with tf.device("/gpu:0"):
(train_step, cross_entropy, bottleneck_input, ground_truth_input, final_tensor) = add_final_training_ops(len(image_lists.keys()),
FLAGS.final_tensor_name,
bottleneck_tensor)`
It causes the following error:
'tensorflow.python.framework.errors.InvalidArgumentError: Cannot assign a device to node 'gradients/Mean_grad/Prod': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available
[[Node: gradients/Mean_grad/Prod = Prod[T=DT_INT32, keep_dims=false, _device="/device:GPU:0"](gradients/Mean_grad/Shape_2, gradients/Mean_grad/range_1)]]
Caused by op u'gradients/Mean_grad/Prod', defined at:
File "retrain_tensorboard_pickle_mean.py", line 921, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv))
File "retrain_tensorboard_pickle_mean.py", line 839, in main
(train_step, cross_entropy, bottleneck_input, ground_truth_input, label_ground_truth_input, final_tensor) = add_final_training_ops(len(image_lists.keys()), FLAGS.final_tensor_name, bottleneck_tensor)
File "retrain_tensorboard_pickle_mean.py", line 686, in add_final_training_ops
cross_entropy_mean)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 190, in minimize
colocate_gradients_with_ops=colocate_gradients_with_ops)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 241, in compute_gradients
colocate_gradients_with_ops=colocate_gradients_with_ops)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients.py", line 481, in gradients
in_grads = _AsList(grad_fn(op, *out_grads))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_grad.py", line 91, in _MeanGrad
factor = (math_ops.reduce_prod(input_shape) //
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 810, in reduce_prod
keep_dims, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 1115, in _prod
keep_dims=keep_dims, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/op_def_library.py", line 655, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2146, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1154, in __init__
self._traceback = _extract_stack()
...which was originally created as op u'Mean', defined at:
File "retrain_tensorboard_pickle_mean.py", line 921, in <module>
tf.app.run()
[elided 1 identical lines from previous traceback]
File "retrain_tensorboard_pickle_mean.py", line 839, in main
(train_step, cross_entropy, bottleneck_input, ground_truth_input, label_ground_truth_input, final_tensor) = add_final_training_ops(len(image_lists.keys()), FLAGS.final_tensor_name, bottleneck_tensor)
File "retrain_tensorboard_pickle_mean.py", line 681, in add_final_training_ops
cross_entropy_mean = tf.reduce_mean(cross_entropy)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 783, in reduce_mean
keep_dims, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 973, in _mean
keep_dims=keep_dims, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/op_def_library.py", line 655, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2146, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1154, in __init__
self._traceback = _extract_stack()
I have found here that it might be a problem that mean is not implemented on GPU but on the other hand there is a commit on github which fixes counting mean on GPU.
Previous part, e.g. generating bottlenecks (line 744) runs perfectly on GPU, without even forcing it.
I would be grateful for any help!!
Justyna
This has now been fixed in b874e2c, nice catch
Related
I am trying out a Faster-RCNN tutorial on colab:
https://colab.research.google.com/drive/1U3fkRu6-hwjk7wWIpg-iylL2u5T9t7rr#scrollTo=uQCnYPVDrsgx
But in the training part, I have received this error
The code is:
!python3 /content/models/research/object_detection/model_main.py
--pipeline_config_path={pipeline_fname}
--model_dir={model_dir}
--alsologtostderr
--num_train_steps={num_steps}
--num_eval_steps={num_eval_steps}
The output:
Using TensorFlow backend.
Cause: module 'gast' has no attribute 'Index'
Traceback (most recent call last):
File "/content/models/research/object_detection/model_main.py", line 109, in
tf.app.run()
File "/tensorflow-1.15.2/python3.7/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "/content/models/research/object_detection/model_main.py", line 105, in main
tf_estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
File "/tensorflow-1.15.2/python3.7/tensorflow_estimator/python/estimator/training.py", line 473, in train_and_evaluate
return executor.run()
File "/tensorflow-1.15.2/python3.7/tensorflow_estimator/python/estimator/training.py", line 613, in run
return self.run_local()
File "/tensorflow-1.15.2/python3.7/tensorflow_estimator/python/estimator/training.py", line 714, in run_local
saving_listeners=saving_listeners)
File "/tensorflow-1.15.2/python3.7/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/tensorflow-1.15.2/python3.7/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/tensorflow-1.15.2/python3.7/tensorflow_estimator/python/estimator/estimator.py", line 1188, in _train_model_default
input_fn, ModeKeys.TRAIN))
File "/tensorflow-1.15.2/python3.7/tensorflow_estimator/python/estimator/estimator.py", line 1025, in _get_features_and_labels_from_input_fn
self._call_input_fn(input_fn, mode))
File "/tensorflow-1.15.2/python3.7/tensorflow_estimator/python/estimator/estimator.py", line 1116, in _call_input_fn
return input_fn(**kwargs)
File "/content/models/research/object_detection/inputs.py", line 765, in _train_input_fn
params=params)
File "/content/models/research/object_detection/inputs.py", line 908, in train_input
reduce_to_frame_fn=reduce_to_frame_fn)
File "/content/models/research/object_detection/builders/dataset_builder.py", line 252, in build
input_reader_config)
File "/content/models/research/object_detection/builders/dataset_builder.py", line 237, in dataset_map_fn
fn_to_map, num_parallel_calls=num_parallel_calls)
File "/tensorflow-1.15.2/python3.7/tensorflow_core/python/util/deprecation.py", line 324, in new_func
return func(*args, **kwargs)
File "/tensorflow-1.15.2/python3.7/tensorflow_core/python/data/ops/dataset_ops.py", line 1950, in map_with_legacy_function
use_legacy_function=True))
File "/tensorflow-1.15.2/python3.7/tensorflow_core/python/data/ops/dataset_ops.py", line 3472, in init
use_legacy_function=use_legacy_function)
File "/tensorflow-1.15.2/python3.7/tensorflow_core/python/data/ops/dataset_ops.py", line 2689, in init
self._function.add_to_graph(ops.get_default_graph())
File "/tensorflow-1.15.2/python3.7/tensorflow_core/python/framework/function.py", line 545, in add_to_graph
self._create_definition_if_needed()
File "/tensorflow-1.15.2/python3.7/tensorflow_core/python/framework/function.py", line 377, in _create_definition_if_needed
self._create_definition_if_needed_impl()
File "/tensorflow-1.15.2/python3.7/tensorflow_core/python/framework/function.py", line 408, in _create_definition_if_needed_impl
capture_resource_var_by_value=self._capture_resource_var_by_value)
File "/tensorflow-1.15.2/python3.7/tensorflow_core/python/framework/function.py", line 944, in func_graph_from_py_func
outputs = func(*func_graph.inputs)
File "/tensorflow-1.15.2/python3.7/tensorflow_core/python/data/ops/dataset_ops.py", line 2681, in wrapper_fn
ret = _wrapper_helper(*args)
File "/tensorflow-1.15.2/python3.7/tensorflow_core/python/data/ops/dataset_ops.py", line 2652, in _wrapper_helper
ret = autograph.tf_convert(func, ag_ctx)(*nested_args)
File "/tensorflow-1.15.2/python3.7/tensorflow_core/python/autograph/impl/api.py", line 237, in wrapper
raise e.ag_error_metadata.to_exception(e)
NotImplementedError: in converted code:
/content/models/research/object_detection/data_decoders/tf_example_decoder.py:580 decode
default_groundtruth_weights)
/tensorflow-1.15.2/python3.7/tensorflow_core/python/util/deprecation.py:507 new_func
return func(*args, **kwargs)
/tensorflow-1.15.2/python3.7/tensorflow_core/python/ops/control_flow_ops.py:1235 cond
orig_res_f, res_f = context_f.BuildCondBranch(false_fn)
/tensorflow-1.15.2/python3.7/tensorflow_core/python/ops/control_flow_ops.py:1061 BuildCondBranch
original_result = fn()
/content/models/research/object_detection/data_decoders/tf_example_decoder.py:573 default_groundtruth_weights
dtype=tf.float32)
/tensorflow-1.15.2/python3.7/tensorflow_core/python/ops/array_ops.py:2560 ones
output = _constant_if_small(one, shape, dtype, name)
/tensorflow-1.15.2/python3.7/tensorflow_core/python/ops/array_ops.py:2295 _constant_if_small
if np.prod(shape) < 1000:
<array_function internals>:6 prod
/usr/local/lib/python3.7/dist-packages/numpy/core/fromnumeric.py:3052 prod
keepdims=keepdims, initial=initial, where=where)
/usr/local/lib/python3.7/dist-packages/numpy/core/fromnumeric.py:86 _wrapreduction
return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
/tensorflow-1.15.2/python3.7/tensorflow_core/python/framework/ops.py:736 array
" array.".format(self.name))
NotImplementedError: Cannot convert a symbolic Tensor (cond_2/strided_slice:0) to a numpy array.
I have tried
pip install numpy==1.19.5
But it does not work and received another error
Traceback (most recent call last):
File "/content/models/research/object_detection/model_main.py", line 26, in
from object_detection import model_lib
File "/content/models/research/object_detection/model_lib.py", line 30, in
from object_detection import eval_util
File "/content/models/research/object_detection/eval_util.py", line 35, in
from object_detection.metrics import coco_evaluation
File "/content/models/research/object_detection/metrics/coco_evaluation.py", line 25, in
from object_detection.metrics import coco_tools
File "/content/models/research/object_detection/metrics/coco_tools.py", line 51, in
from pycocotools import coco
File "/usr/local/lib/python3.7/dist-packages/pycocotools/coco.py", line 52, in
from . import mask as maskUtils
File "/usr/local/lib/python3.7/dist-packages/pycocotools/mask.py", line 3, in
import pycocotools._mask as _mask
File "pycocotools/_mask.pyx", line 1, in init pycocotools._mask
ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject
I'm trying debug my trained Faster R-CNN model using Tensorflow Object Detection API and I want to visualize the proposal regions of RPN on an image. Can anyone tell me how to do it?
I found a post here but it hasn't been answered. I tried to export the model using exporter_main_v2.py with only the RPN head as said here and this is the massage when I deleted the second_stage.
Traceback (most recent call last):
File "exporter_main_v2.py", line 165, in <module>
app.run(main)
File "E:\Anaconda\envs\TFOD\lib\site-packages\absl\app.py", line 312, in run
_run_main(main, args)
File "E:\Anaconda\envs\TFOD\lib\site-packages\absl\app.py", line 258, in _run_main
sys.exit(main(argv))
File "exporter_main_v2.py", line 158, in main
exporter_lib_v2.export_inference_graph(
File "E:\Anaconda\envs\TFOD\lib\site-packages\object_detection\exporter_lib_v2.py", line 245, in export_inference_graph
detection_model = INPUT_BUILDER_UTIL_MAP['model_build'](
File "E:\Anaconda\envs\TFOD\lib\site-packages\object_detection\builders\model_builder.py", line 1226, in build
return build_func(getattr(model_config, meta_architecture), is_training,
File "E:\Anaconda\envs\TFOD\lib\site-packages\object_detection\builders\model_builder.py", line 665, in _build_faster_rcnn_model
second_stage_box_predictor = box_predictor_builder.build_keras(
File "E:\Anaconda\envs\TFOD\lib\site-packages\object_detection\builders\box_predictor_builder.py", line 991, in build_keras
raise ValueError(
ValueError: Unknown box predictor for Keras: None
I tried again to export the model without deleting the second_stage. And this is the message I got
INFO:tensorflow:depth of additional conv before box predictor: 0
I0802 20:55:13.930429 1996 convolutional_keras_box_predictor.py:153] depth of additional conv before box predictor: 0
Traceback (most recent call last):
File "exporter_main_v2.py", line 165, in <module>
app.run(main)
File "E:\Anaconda\envs\TFOD\lib\site-packages\absl\app.py", line 312, in run
_run_main(main, args)
File "E:\Anaconda\envs\TFOD\lib\site-packages\absl\app.py", line 258, in _run_main
sys.exit(main(argv))
File "exporter_main_v2.py", line 158, in main
exporter_lib_v2.export_inference_graph(
File "E:\Anaconda\envs\TFOD\lib\site-packages\object_detection\exporter_lib_v2.py", line 271, in export_inference_graph
concrete_function = detection_module.__call__.get_concrete_function()
File "E:\Anaconda\envs\TFOD\lib\site-packages\tensorflow\python\eager\def_function.py", line 1299, in get_concrete_function
concrete = self._get_concrete_function_garbage_collected(*args, **kwargs)
File "E:\Anaconda\envs\TFOD\lib\site-packages\tensorflow\python\eager\def_function.py", line 1205, in _get_concrete_function_garbage_collected
self._initialize(args, kwargs, add_initializers_to=initializers)
File "E:\Anaconda\envs\TFOD\lib\site-packages\tensorflow\python\eager\def_function.py", line 725, in _initialize
self._stateful_fn._get_concrete_function_internal_garbage_collected( # pylint: disable=protected-access
File "E:\Anaconda\envs\TFOD\lib\site-packages\tensorflow\python\eager\function.py", line 2969, in _get_concrete_function_internal_garbage_collected
graph_function, _ = self._maybe_define_function(args, kwargs)
File "E:\Anaconda\envs\TFOD\lib\site-packages\tensorflow\python\eager\function.py", line 3361, in _maybe_define_function
graph_function = self._create_graph_function(args, kwargs)
File "E:\Anaconda\envs\TFOD\lib\site-packages\tensorflow\python\eager\function.py", line 3196, in _create_graph_function
func_graph_module.func_graph_from_py_func(
File "E:\Anaconda\envs\TFOD\lib\site-packages\tensorflow\python\framework\func_graph.py", line 990, in func_graph_from_py_func
func_outputs = python_func(*func_args, **func_kwargs)
File "E:\Anaconda\envs\TFOD\lib\site-packages\tensorflow\python\eager\def_function.py", line 634, in wrapped_fn
out = weak_wrapped_fn().__wrapped__(*args, **kwds)
File "E:\Anaconda\envs\TFOD\lib\site-packages\tensorflow\python\framework\func_graph.py", line 977, in wrapper
raise e.ag_error_metadata.to_exception(e)
tensorflow.python.autograph.pyct.error_utils.KeyError: in user code:
E:\Anaconda\envs\TFOD\lib\site-packages\object_detection\exporter_lib_v2.py:163 call_func *
return self._run_inference_on_images(images, true_shapes, **kwargs)
E:\Anaconda\envs\TFOD\lib\site-packages\object_detection\exporter_lib_v2.py:129 _run_inference_on_images *
detections[classes_field] = (
KeyError: 'detection_classes'
Found the solution!
In the config file add number_of_stages: 1
Instead of using exporter_main_v2.pyI write code that builds the model from the checkpoint file
# Load pipeline config and build a detection model
configs = config_util.get_configs_from_pipeline_file(path_to_config)
model_config = configs['model']
detection_model = model_builder.build(model_config=model_config, is_training=False)
# Restore checkpoint
ckpt = tf.compat.v2.train.Checkpoint(model=detection_model)
ckpt.restore(os.path.join(path_to_ckpt, 'ckpt-0')).expect_partial()
Then I feed the image I need to inspect to the model, then I use object_detection.utils.visualization_utils.visualize_boxes_and_labels_on_image_array to inspect the boxes
I am trying to train my model using google cloud's TPUs. The model works fine on CPU and GPU, and I can run the TPU tutorials without any problems (so it is not a problem of connecting to TPUs). However, when I run my program on the TPU cloud I get an error. The most important line is probably the following:
NotImplementedError: Non-resource Variables are not supported inside TPU computations (operator name: training_op/update_2nd_caps/primary_to_first_fc/W/ApplyAdam/RefEnter)
And here is the full error in case there is something important there:
Traceback (most recent call last):
File "TPU_playground.py", line 85, in <module>
capser.train(input_fn=train_input_fn_tpu, steps=n_steps)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 366, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1119, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1132, in _train_model_default
features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1992, in _call_model_fn
features, labels, mode, config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1107, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2223, in _model_fn
_train_on_tpu_system(ctx, model_fn_wrapper, dequeue_fn))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2537, in _train_on_tpu_system
device_assignment=ctx.device_assignment)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu.py", line 733, in shard
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu.py", line 394, in replicate
device_assignment, name)[1]
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu.py", line 546, in split_compile_and_replicate
outputs = computation(*computation_inputs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2530, in multi_tpu_train_steps_on_single_shard
single_tpu_train_step, [_INITIAL_LOSS])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/training_loop.py", line 207, in repeat
cond, body_wrapper, inputs=inputs, infeed_queue=infeed_queue, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/training_loop.py", line 169, in while_loop
name="")
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 3209, in while_loop
result = loop_context.BuildLoop(cond, body, loop_vars, shape_invariants)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2941, in BuildLoop
pred, body, original_loop_vars, loop_vars, shape_invariants)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2878, in _BuildLoop
body_result = body(*packed_vars_for_body)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/training_loop.py", line 120, in body_wrapper
outputs = body(*(inputs + dequeue_ops))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/training_loop.py", line 203, in body_wrapper
return [i + 1] + _convert_to_list(body(*args))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1166, in train_step
self._call_model_fn(features, labels))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1337, in _call_model_fn
estimator_spec = self._model_fn(features=features, **kwargs)
File "/home/adrien_doerig/capser/capser_7_model_fn.py", line 100, in model_fn_tpu
**output_decoder_deconv_params)
File "/home/adrien_doerig/capser/capser_model.py", line 341, in capser_model
loss_training_op = optimizer.minimize(loss=loss, global_step=tf.train.get_global_step(), name="training_op")
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 409, in minimize
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_optimizer.py", line 114, in apply_gradients
return self._opt.apply_gradients(summed_grads_and_vars, global_step, name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 602, in apply_gradients
update_ops.append(processor.update_op(self, grad))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 113, in update_op
update_op = optimizer._apply_dense(g, self._v) # pylint: disable=protected-access
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/adam.py", line 148, in _apply_dense
grad, use_locking=self._use_locking).op
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/gen_training_ops.py", line 293, in apply_adam
use_nesterov=use_nesterov, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3414, in create_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1782, in __init__
self._control_flow_post_processing()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1793, in _control_flow_post_processing
self._control_flow_context.AddOp(self)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2430, in AddOp
self._AddOpInternal(op)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2451, in _AddOpInternal
real_x = self.AddValue(x)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2398, in AddValue
self._outer_context.AddInnerOp(enter.op)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu.py", line 310, in AddInnerOp
self._AddOpInternal(op)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu.py", line 287, in _AddOpInternal
"(operator name: %s)" % op.name)
NotImplementedError: Non-resource Variables are not supported inside TPU computations (operator name: training_op/update_2nd_caps/primary_to_first_fc/W/ApplyAdam/RefEnter)
It seems that the forward pass of the graph is built fine, but the backprop using AdamOptimizer is not supported by the TPUs in this case. I tried using more standard optimizers (GradientDescentOptimizer and MomentumOptimizer) but it doesn't help. All the tensors in the feedforward pass are in formats compatible with TPUs (i.e. tf.float32).
Does anyone have suggestions as to what I should try?
Thank you!
I have found a way to use the TPUs without using the ctpu up command, which solves the problem. I simply do everything exactly as I would do it to run my code on cloud GPUs:
-- see documentation here: https://cloud.google.com/ml-engine/docs/tensorflow/getting-started-training-prediction
-- a simple explanatory video here: https://www.youtube.com/watch?v=J_d4bEKUG2Q
BUT, the ONLY DIFFERENCE is that I use --scale-tier 'BASIC_TPU' instead of --scale-tier 'STANDARD_1' when I run my job. So the command to run the job is
gcloud ml-engine jobs submit training $JOB_NAME --module-name capser.capser_7_multi_gpu --package-path ./capser --job-dir=gs://capser-data/$JOB_NAME --scale-tier 'BASIC_TPU' --stream-logs --runtime-version 1.9
--region us-central1
(I previously define the variable $JOB_NAME: export JOB_NAME=<input your job name>)
Also, make sure you choose a region which has TPUs! us-central1 works for example.
So maybe it is a small bug when using ctpu up, but it seems not to be a problem when using the above method. I hope that helps!
I training maskrcnn ,use tf-1.2 can train, but I use tf-1.5 it not training
The error is as follows:
Caused by op u'pyramid_1/AssignGTBoxes/Where_6', defined at:
File "/home/zhouzd2/letrain/applications/letrain.py", line 349, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 124, in run
_sys.exit(main(argv))
File "/home/zhouzd2/letrain/applications/letrain.py", line 346, in main
LeTrain().model_train(user_mode)
File "/home/zhouzd2/letrain/platform/base_train.py", line 1228, in model_train
cluster=self.cluster_spec)
File "/home/zhouzd2/letrain/platform/deployment/model_deploy.py", line 226, in create_clones
outputs, feed_ops,verify_model_loss = model_fn(*args, **kwargs)
File "/home/zhouzd2/letrain/platform/base_train.py", line 1195, in clone_fn
model_loss, end_points, feed_ops = network_fn(data_direct, data_batch, int_network_fn)
File "/home/zhouzd2/letrain/applications/letrain.py", line 214, in get_loss
FLAGS.batch_size)
File "/home/zhouzd2/letrain/applications/fmrcnn/get_fmrcnn_loss.py", line 23, in model_fn
loss_weights=[0.2, 0.2, 1.0, 0.2, 1.0])
File "/home/zhouzd2/letrain/applications/fmrcnn/libs/nets/pyramid_network.py", line 580, in build
is_training=is_training, gt_boxes=gt_boxes)
File "/home/zhouzd2/letrain/applications/fmrcnn/libs/nets/pyramid_network.py", line 263, in build_heads
assign_boxes(rois, [rois, batch_inds], [2, 3, 4, 5])
File "/home/zhouzd2/letrain/applications/fmrcnn/libs/layers/wrapper.py", line 173, in assign_boxes
inds = tf.where(tf.equal(assigned_layers, l))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/array_ops.py", line 2538, in where
return gen_array_ops.where(condition=condition, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 6087, in where
"Where", input=condition, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3160, in create_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1625, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InternalError (see above for traceback): WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true / nonzero indices. temp_storage_bytes: 1, status: no kernel image is available for execution on the device
[[Node: pyramid_1/AssignGTBoxes/Where_6 = Where[T=DT_BOOL, _device="/job:worker/replica:0/task:0/device:GPU:0"](pyramid_1/AssignGTBoxes/Equal_6_S9493)]]
[[Node: pyramid_1/AssignGTBoxes/Reshape_8_G1028 = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:0/device:CPU:0", send_device="/job:worker/replica:0/task:0/device:GPU:0", send_device_incarnation=5407481677180697062, tensor_name="edge_1349_pyramid_1/AssignGTBoxes/Reshape_8", tensor_type=DT_INT64, _device="/job:worker/replica:0/task:0/device:CPU:0"]()]]
No problem when loading calculation graphs, error is reported in sess.run()。
Does anyone know how to solve this problem? Or does anyone know what function can replace tf.where?
Thank you!
If you are using Visual Studio:
Right click on the project > Properies > Cuda C/C++ > Device
and add the following to Code Generation field
compute_30,sm_30;compute_35,sm_35;compute_37,sm_37;compute_50,sm_50;compute_52,sm_52;compute_60,sm_60;compute_61,sm_61;compute_70,sm_70;compute_75,sm_75;
I have a saved checkpoint generated by graph code in a regular non-distributed setup with the constraint with tf.device('/cpu:0'): (to force model params to reside on CPU instead of GPU).
Now I converted the same code/graph to a distributed setting following the guidelines in TF-Inception.
Now when I try to restore the checkpoint in distributed setup, I get device mismatch errors. Is there a way to override the requirements saved in the checkpoint file or something?
My new distributed code has the Saver and scopes defined as:
if FLAGS.job_name == 'worker':
with tf.device(tf.train.replica_device_setter(
worker_device="/job:worker/task:%d" % FLAGS.task_id,
cluster=cluster_spec)):
# ...same network-graph code... #
restorer = tf.train.Saver()
with tf.Session() as sess:
restorer.restore(sess, 'ResNet-L50.ckpt')
My cluster has one ps and one worker, and both are on localhost. Error line:
tensorflow.python.framework.errors.InvalidArgumentError: Cannot assign a device to node 'save/restore_slice_268/shape_and_slice': Could not satisfy explicit device specification '/job:ps/task:0/device:CPU:0' because no devices matching that specification are registered in this process; available devices: /job:localhost/replica:0/task:0/cpu:0, /job:localhost/replica:0/task:0/gpu:0
[[Node: save/restore_slice_268/shape_and_slice = Const[dtype=DT_STRING, value=Tensor<type: string shape: [] values: >, _device="/job:ps/task:0/device:CPU:0"]()]]
Full error trace:
I tensorflow/core/common_runtime/gpu/gpu_device.cc:756] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Quadro K2200, pci bus id: 0000:01:00.0)
Traceback (most recent call last):
File "dlaunch.py", line 85, in <module>
tf.app.run() # (tf.app.flags parsed here)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv))
File "dlaunch.py", line 81, in main
dtrainer.train(server.target, cluster_spec)
File "/home/muneeb/parkingtf/dtrainer.py", line 88, in train
restorer.restore(sess, 'ResNet-L50.ckpt')
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1103, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 328, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 563, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 636, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 658, in _do_call
e.code)
tensorflow.python.framework.errors.InvalidArgumentError: Cannot assign a device to node 'save/restore_slice_268/shape_and_slice': Could not satisfy explicit device specification '/job:ps/task:0/device:CPU:0' because no devices matching that specification are registered in this process; available devices: /job:localhost/replica:0/task:0/cpu:0, /job:localhost/replica:0/task:0/gpu:0
[[Node: save/restore_slice_268/shape_and_slice = Const[dtype=DT_STRING, value=Tensor<type: string shape: [] values: >, _device="/job:ps/task:0/device:CPU:0"]()]]
Caused by op u'save/restore_slice_268/shape_and_slice', defined at:
File "dlaunch.py", line 85, in <module>
tf.app.run() # (tf.app.flags parsed here)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv))
File "dlaunch.py", line 81, in main
dtrainer.train(server.target, cluster_spec)
File "/home/muneeb/parkingtf/dtrainer.py", line 86, in train
restorer = tf.train.Saver()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 845, in __init__
restore_sequentially=restore_sequentially)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 515, in build
filename_tensor, vars_to_save, restore_sequentially, reshape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 271, in _AddRestoreOps
values = self.restore_op(filename_tensor, vs, preferred_shard)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 186, in restore_op
preferred_shard=preferred_shard)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/io_ops.py", line 201, in _restore_slice
preferred_shard, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 271, in _restore_slice
preferred_shard=preferred_shard, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/op_def_library.py", line 444, in apply_op
as_ref=input_arg.is_ref)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 566, in convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/constant_op.py", line 179, in _constant_tensor_conversion_function
return constant(v, dtype=dtype, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/constant_op.py", line 166, in constant
attrs={"value": tensor_value, "dtype": dtype_value}, name=name).outputs[0]
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2162, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1161, in __init__
self._traceback = _extract_stack()
The following line:
with tf.Session() as sess:
...is responsible for the error. Passing no arguments to tf.Session() creates an in-process session that can only use devices on the local machine. To work in the distributed mode, you should have something like:
# Assuming you created `server = tf.train.Server(...)` earlier.
with tf.Session(server.target) as sess:
...or, if you are connecting to a different process:
# Assuming your server is in a different process.
with tf.Session("grpc://..."):
Note that the devices are not stored in the checkpoint file, but they are being added by the tf.train.replica_device_setter(). Device configuration is a bit tricky right now, and it's something that we're working to simplify.