I'm training a small, simple neural net for a basic problem of regulating a motor's speed. I want to be able to save the model and exit the program, then load it later and resume training.
Here's the relevant code:
self.model = Sequential()
self.model.add(InputLayer(2))
self.model.add(Dense(6, activation='relu'))
self.model.add(Dense(9, activation='linear'))
self.model.compile(loss='mse', optimizer='adam', metrics=['mae'])
# ... Loop for training and Evaluation (Deep Q Learner) ...
learn(self.model)
self.model.save('motor_model', save_format='tf')
Now after it's trained I want to be able to load the model and continue training
self.model = models.load_model('motor_model', compile=False)
# ... Loop for training and Evaluation (Deep Q Learner) ...
learn(self.model)
The first time I run the model it works fine. However, after saving and loading the model it does not. Upon loading the model I am able to call the predict function:
prediction = self.model.predict(currentInput)
However, It fails when I call the predict function:
self.model.fit(self.input, target_vec.reshape(-1, 9), epochs=1, verbose=0)
The error I get is:
2019-12-07 07:22:00.762174: W tensorflow/c/c_api.cc:326] Operation
'{name:'sequential/dense/StatefulPartitionedCall'
id:33 op device:{} def:{{{node
sequential/dense/StatefulPartitionedCall}} =
StatefulPartitionedCall[Tin=[DT_FLOAT, DT
_RESOURCE, DT_RESOURCE], Tout=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _gradient_op_type="PartitionedCall-298", conf
ig="",
config_proto="\n\007\n\003CPU\020\001\n\007\n\003GPU\020\0002\002J\0008\001",
executor_type="", f=__forward_re
stored_function_body_509[]](input_1, dense/kernel, dense/bias)}}' was
changed by setting attribute after it was run b
y a session. This mutation will have no effect, and will trigger an
error in the future. Either don't modify nodes af
ter running them or create a new session. 2019-12-07 07:22:03.320478:
W tensorflow/python/util/util.cc:299] Sets are not currently
considered sequences, but th
is may change in the future, so consider avoiding using them.
Traceback (most recent call last): File
"/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py",
line 1363, in _do_call
return fn(*args) File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py",
line 1346, in _run_fn
self._extend_graph() File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py",
line 1386, in _extend_graph
tf_session.ExtendSession(self._session) tensorflow.python.framework.errors_impl.InvalidArgumentError: Node
'training/Adam/gradients/gradients/sequential/dens
e_1/StatefulPartitionedCall_grad/PartitionedCall': Connecting to
invalid output 1 of source node sequential/dense_1/S
tatefulPartitionedCall which has 1 outputs.
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "ct2.py", line 47, in
leftController.to_position(target, overrideAction) File "/opt/mowzr/motor_controller.py", line 94, in to_position
self.model.fit(self.prevInput, target_vec.reshape(-1, 9), epochs=1, verbose=0) File
"/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/keras/engine/training.py",
line 766, in fit
use_multiprocessing=use_multiprocessing) File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/keras/engine/training_arrays.py",
line 680, in
fit
steps_name='steps_per_epoch') File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/keras/engine/training_arrays.py",
line 275, in
model_iteration
model.reset_metrics() File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/keras/engine/training.py",
line 953, in reset_m
etrics
m.reset_states() File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/keras/metrics.py",
line 209, in reset_states
K.batch_set_value([(v, 0) for v in self.variables]) File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/keras/backend.py",
line 3343, in batch_set_valu
e
get_session().run(assign_ops, feed_dict=feed_dict) File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/keras/backend.py",
line 490, in get_session
_initialize_variables(session) File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/keras/backend.py",
line 905, in _initialize_var
iables
[variables_module.is_variable_initialized(v) for v in candidate_vars]) File
"/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py",
line 956, in run
run_metadata_ptr) File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py",
line 1179, in _run
feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py",
line 1357, in _do_run
run_metadata) File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py",
line 1382, in _do_call
raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: Node
'training/Adam/gradients/gradients/sequential/dens
e_1/StatefulPartitionedCall_grad/PartitionedCall': Connecting to
invalid output 1 of source node sequential/dense_1/S
tatefulPartitionedCall which has 1 outputs.
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File
"/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py",
line 1363, in _do_call
return fn(*args) File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py",
line 1346, in _run_fn
self._extend_graph() File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py",
line 1386, in _extend_graph
tf_session.ExtendSession(self._session) tensorflow.python.framework.errors_impl.InvalidArgumentError: Node
'training/Adam/gradients/gradients/sequential/dens
e_1/StatefulPartitionedCall_grad/PartitionedCall': Connecting to
invalid output 1 of source node sequential/dense_1/S
tatefulPartitionedCall which has 1 outputs.
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "ct2.py", line 53, in
leftController.saveModel() File "/opt/mowzr/motor_controller.py", line 116, in saveModel
self.model.save('motor_model', save_format='tf') File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/keras/engine/network.py",
line 986, in save
signatures, options) File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/keras/saving/save.py",
line 115, in save_model
signatures, options) File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/keras/saving/saved_model/save.py",
line 74, in
save
save_lib.save(model, filepath, signatures, options) File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/saved_model/save.py",
line 924, in save
object_saver.save(utils_impl.get_variables_path(export_dir)) File
"/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/training/tracking/util.py",
line 1161, in save
session = get_session() File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/training/tracking/util.py",
line 71, in get_ses
sion
session = keras_backend.get_session() File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/keras/backend.py",
line 490, in get_session
_initialize_variables(session) File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/keras/backend.py",
line 905, in _initialize_var
iables
[variables_module.is_variable_initialized(v) for v in candidate_vars]) File
"/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py",
line 956, in run
run_metadata_ptr) File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py",
line 1179, in _run
feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py",
line 1357, in _do_run
run_metadata) File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py",
line 1382, in _do_call
raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: Node
'training/Adam/gradients/gradients/sequential/dens
e_1/StatefulPartitionedCall_grad/PartitionedCall': Connecting to
invalid output 1 of source node sequential/dense_1/S
tatefulPartitionedCall which has 1 outputs.
I got the same error.
I don't know what exactly produces this error but there is a way to solve it (not a pretty one though). Create the model with the same architecture and just set the weights to the loaded model weights:
self.model = self.create_model()
self.model.set_weights(load_model("sample.model").get_weights())
Related
I'm trying debug my trained Faster R-CNN model using Tensorflow Object Detection API and I want to visualize the proposal regions of RPN on an image. Can anyone tell me how to do it?
I found a post here but it hasn't been answered. I tried to export the model using exporter_main_v2.py with only the RPN head as said here and this is the massage when I deleted the second_stage.
Traceback (most recent call last):
File "exporter_main_v2.py", line 165, in <module>
app.run(main)
File "E:\Anaconda\envs\TFOD\lib\site-packages\absl\app.py", line 312, in run
_run_main(main, args)
File "E:\Anaconda\envs\TFOD\lib\site-packages\absl\app.py", line 258, in _run_main
sys.exit(main(argv))
File "exporter_main_v2.py", line 158, in main
exporter_lib_v2.export_inference_graph(
File "E:\Anaconda\envs\TFOD\lib\site-packages\object_detection\exporter_lib_v2.py", line 245, in export_inference_graph
detection_model = INPUT_BUILDER_UTIL_MAP['model_build'](
File "E:\Anaconda\envs\TFOD\lib\site-packages\object_detection\builders\model_builder.py", line 1226, in build
return build_func(getattr(model_config, meta_architecture), is_training,
File "E:\Anaconda\envs\TFOD\lib\site-packages\object_detection\builders\model_builder.py", line 665, in _build_faster_rcnn_model
second_stage_box_predictor = box_predictor_builder.build_keras(
File "E:\Anaconda\envs\TFOD\lib\site-packages\object_detection\builders\box_predictor_builder.py", line 991, in build_keras
raise ValueError(
ValueError: Unknown box predictor for Keras: None
I tried again to export the model without deleting the second_stage. And this is the message I got
INFO:tensorflow:depth of additional conv before box predictor: 0
I0802 20:55:13.930429 1996 convolutional_keras_box_predictor.py:153] depth of additional conv before box predictor: 0
Traceback (most recent call last):
File "exporter_main_v2.py", line 165, in <module>
app.run(main)
File "E:\Anaconda\envs\TFOD\lib\site-packages\absl\app.py", line 312, in run
_run_main(main, args)
File "E:\Anaconda\envs\TFOD\lib\site-packages\absl\app.py", line 258, in _run_main
sys.exit(main(argv))
File "exporter_main_v2.py", line 158, in main
exporter_lib_v2.export_inference_graph(
File "E:\Anaconda\envs\TFOD\lib\site-packages\object_detection\exporter_lib_v2.py", line 271, in export_inference_graph
concrete_function = detection_module.__call__.get_concrete_function()
File "E:\Anaconda\envs\TFOD\lib\site-packages\tensorflow\python\eager\def_function.py", line 1299, in get_concrete_function
concrete = self._get_concrete_function_garbage_collected(*args, **kwargs)
File "E:\Anaconda\envs\TFOD\lib\site-packages\tensorflow\python\eager\def_function.py", line 1205, in _get_concrete_function_garbage_collected
self._initialize(args, kwargs, add_initializers_to=initializers)
File "E:\Anaconda\envs\TFOD\lib\site-packages\tensorflow\python\eager\def_function.py", line 725, in _initialize
self._stateful_fn._get_concrete_function_internal_garbage_collected( # pylint: disable=protected-access
File "E:\Anaconda\envs\TFOD\lib\site-packages\tensorflow\python\eager\function.py", line 2969, in _get_concrete_function_internal_garbage_collected
graph_function, _ = self._maybe_define_function(args, kwargs)
File "E:\Anaconda\envs\TFOD\lib\site-packages\tensorflow\python\eager\function.py", line 3361, in _maybe_define_function
graph_function = self._create_graph_function(args, kwargs)
File "E:\Anaconda\envs\TFOD\lib\site-packages\tensorflow\python\eager\function.py", line 3196, in _create_graph_function
func_graph_module.func_graph_from_py_func(
File "E:\Anaconda\envs\TFOD\lib\site-packages\tensorflow\python\framework\func_graph.py", line 990, in func_graph_from_py_func
func_outputs = python_func(*func_args, **func_kwargs)
File "E:\Anaconda\envs\TFOD\lib\site-packages\tensorflow\python\eager\def_function.py", line 634, in wrapped_fn
out = weak_wrapped_fn().__wrapped__(*args, **kwds)
File "E:\Anaconda\envs\TFOD\lib\site-packages\tensorflow\python\framework\func_graph.py", line 977, in wrapper
raise e.ag_error_metadata.to_exception(e)
tensorflow.python.autograph.pyct.error_utils.KeyError: in user code:
E:\Anaconda\envs\TFOD\lib\site-packages\object_detection\exporter_lib_v2.py:163 call_func *
return self._run_inference_on_images(images, true_shapes, **kwargs)
E:\Anaconda\envs\TFOD\lib\site-packages\object_detection\exporter_lib_v2.py:129 _run_inference_on_images *
detections[classes_field] = (
KeyError: 'detection_classes'
Found the solution!
In the config file add number_of_stages: 1
Instead of using exporter_main_v2.pyI write code that builds the model from the checkpoint file
# Load pipeline config and build a detection model
configs = config_util.get_configs_from_pipeline_file(path_to_config)
model_config = configs['model']
detection_model = model_builder.build(model_config=model_config, is_training=False)
# Restore checkpoint
ckpt = tf.compat.v2.train.Checkpoint(model=detection_model)
ckpt.restore(os.path.join(path_to_ckpt, 'ckpt-0')).expect_partial()
Then I feed the image I need to inspect to the model, then I use object_detection.utils.visualization_utils.visualize_boxes_and_labels_on_image_array to inspect the boxes
I first constructed an RBM and tested it on a set of data, it worked well. Then I wrote a DBN with stacked RBM and trained it with the same set of data. The program stopped with the following error when it tried to train the second RBM.
Traceback (most recent call last):
File "D:\Python\DL_DG\analysis\debug\debug_01_ppi.py", line 44, in <module>
ppi_dbn.fit(ppi_in)
File "D:/Python/DL_DG/Model\dbn_test.py", line 95, in fit
rbm.fit(input_data)
File "D:/Python/DL_DG/Model\rbm_test.py", line 295, in fit
self.partial_fit(batch_x, b, e)
File "D:/Python/DL_DG/Model\rbm_test.py", line 188, in partial_fit
feed_dict={self.x: batch_x})
File "C:\Users\pil562\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\client\session.py", line 895, in run
run_metadata_ptr)
File "C:\Users\pil562\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\client\session.py", line 1124, in _run
feed_dict_tensor, options, run_metadata)
File "C:\Users\pil562\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\client\session.py", line 1321, in _do_run
options, run_metadata)
File "C:\Users\pil562\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\client\session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: You must feed a value for placeholder tensor 'input/x' with dtype float and shape [?,128]
[[Node: input/x = Placeholder[dtype=DT_FLOAT, shape=[?,128], _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
Caused by op 'input/x', defined at:
File "<string>", line 1, in <module>
File "C:\Users\pil562\AppData\Local\Programs\Python\Python36\lib\idlelib\run.py", line 142, in main
ret = method(*args, **kwargs)
File "C:\Users\pil562\AppData\Local\Programs\Python\Python36\lib\idlelib\run.py", line 460, in runcode
exec(code, self.locals)
File "D:\Python\DL_DG\analysis\debug\debug_01_ppi.py", line 42, in <module>
learning_rate_rbm=[0.001,0.01],rbm_gauss_visible=True)
File "D:/Python/DL_DG/Model\dbn_test.py", line 52, in __init__
sample_gauss_visible=self.sample_gauss_visible, sigma=self.sigma))
File "D:/Python/DL_DG/Model\rbm_test.py", line 358, in __init__
xavier_const,err_function,use_tqdm,tqdm)
File "D:/Python/DL_DG/Model\rbm_test.py", line 46, in __init__
self.x = tf.placeholder(tf.float32, [None, self.n_visible],name='x')
File "C:\Users\pil562\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\ops\array_ops.py", line 1548, in placeholder
return gen_array_ops._placeholder(dtype=dtype, shape=shape, name=name)
File "C:\Users\pil562\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\ops\gen_array_ops.py", line 2094, in _placeholder
name=name)
File "C:\Users\pil562\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 767, in apply_op
op_def=op_def)
File "C:\Users\pil562\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\framework\ops.py", line 2630, in create_op
original_op=self._default_original_op, op_def=op_def)
File "C:\Users\pil562\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\framework\ops.py", line 1204, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InvalidArgumentError (see above for traceback): You must feed a value for placeholder tensor 'input/x' with dtype float and shape [?,128]
[[Node: input/x = Placeholder[dtype=DT_FLOAT, shape=[?,128], _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
The error occurs at the following function:
def partial_fit(self, batch_x, k, j):
print(batch_x.dtype, batch_x.shape)
summary, _ = self.sess.run([self.merged, self.update_weights + self.update_deltas],
feed_dict={self.x: batch_x})
self.train_writer.add_summary(summary, k*self.batch_size+j)
I output the type and shape of batch_x. The shape is the same during the whole training process. The type is float64 when training the first rbm, and float32 when training the second rbm. That's where it stopped and throw out the error.
The DBN worked well when I didn't compute the summary and just used the following code:
self.sess.run(self.update_weights + self.update_deltas,feed_dict={self.x: batch_x})
It also worked well if I only train a single RBM (with or without the summary).
The batch_x used to train the second RBM is probabilities of the hidden layer in the first RBM.
Could somebody help me solve this problem? I'm not sure if the float64 is the problem.
I guess it's hard for anyone to solve the problem only with the two pieces of code I give. lol. The full code is too long to post here.
I save the output of the first RBM and use it as input to train another RBM. It works well. Thus, I think the problem is not the type or shape of the feeded batch_x, but the structure of the DBN, or the way I collected summaries.
Hope my situation can help others with similar problems.
In a task to implement the minimum risk training for a neural machine translation system I need to sample sentences and gather the respective logits for the sampled word IDs. The step of gathering looks like this:
for i in range(1,self._num_of_samples):
logits, _, _, sampled_ids = self.decoder._decoding_loop(train_mode=False,sample=True)
ind=[[[tf.constant(i),tf.constant(j),sampled_ids[i][j]] for j in range(self.batch_size)] for i in range(self.decoder.max_output_len)]
gathered_logits = tf.gather_nd(logits,ind)
sentence_sum_logit = tf.reduce_sum(gathered_logits,0)
self.sample_sen_ids = self.sample_sen_ids.write(steps[i],sampled_ids)
self.sample_logits = self.sample_logits.write(steps[i], sentence_sum_logit)
self.sample_sen_ids = tf.transpose(self.sample_sen_ids.stack())
self.sample_logits = tf.transpose(self.sample_logits.stack())
But I dont get it why after some batches I get this:
Traceback (most recent call last):
File "/home/stoyan/neurmon/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1022, in _do_call
return fn(*args)
File "/home/stoyan/neurmon/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1004, in _run_fn
status, run_metadata)
File "/usr/lib/python3.5/contextlib.py", line 66, in __exit__
next(self.gen)
File "/home/stoyan/neurmon/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: slice index 49 of dimension 0 out of bounds.
[[Node: sampling/strided_slice_4900 = StridedSlice[Index=DT_INT32, T=DT_INT32, begin_mask=0, ellipsis_mask=0, end_mask=0, new_axis_mask=0, shrink_axis_mask=1, _device="/job:localhost/replica:0/task:0/cpu:0"](sampling/TensorArrayStack_3/TensorArrayGatherV3, sampling/strided_slice_4900/stack, sampling/strided_slice_4900/stack_1, sampling/strided_slice_4900/stack_2)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "bin/neuralmonkey-train", line 6, in <module>
main()
File "/home/stoyan/neuralmonkey/bin/neuralmonkey/train.py", line 211, in main
initial_variables=cfg.model.initial_variables)
File "/home/stoyan/neuralmonkey/bin/neuralmonkey/learning_utils.py", line 185, in training_loop
results, meta=tf_manager.execute(batch_dataset, [trainer],train=True, summaries=False)
File "/home/stoyan/neuralmonkey/bin/neuralmonkey/tf_manager.py", line 217, in execute
for sess in self.sessions]
File "/home/stoyan/neuralmonkey/bin/neuralmonkey/tf_manager.py", line 217, in <listcomp>
for sess in self.sessions]
File "/home/stoyan/neurmon/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 767, in run
run_metadata_ptr)
File "/home/stoyan/neurmon/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 965, in _run
feed_dict_string, options, run_metadata)
File "/home/stoyan/neurmon/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1015, in _do_run
target_list, options, run_metadata)
File "/home/stoyan/neurmon/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1035, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: slice index 49 of dimension 0 out of bounds.
[[Node: sampling/strided_slice_4900 = StridedSlice[Index=DT_INT32, T=DT_INT32, begin_mask=0, ellipsis_mask=0, end_mask=0, new_axis_mask=0, shrink_axis_mask=1, _device="/job:localhost/replica:0/task:0/cpu:0"](sampling/TensorArrayStack_3/TensorArrayGatherV3, sampling/strided_slice_4900/stack, sampling/strided_slice_4900/stack_1, sampling/strided_slice_4900/stack_2)]]
Caused by op 'sampling/strided_slice_4900', defined at:
File "bin/neuralmonkey-train", line 6, in <module>
main()
File "/home/stoyan/neuralmonkey/bin/neuralmonkey/train.py", line 170, in main
cfg.build_model(warn_unused=True)
File "/home/stoyan/neuralmonkey/bin/neuralmonkey/config/configuration.py", line 86, in build_model
model = build_config(self.config_dict, self.ignored, warn_unused)
File "/home/stoyan/neuralmonkey/bin/neuralmonkey/config/builder.py", line 198, in build_config
value, config_dicts, existing_objects, 0)
File "/home/stoyan/neuralmonkey/bin/neuralmonkey/config/builder.py", line 109, in build_object
obj = instantiate_class(value[7:], all_dicts, existing_objects, depth)
File "/home/stoyan/neuralmonkey/bin/neuralmonkey/config/builder.py", line 165, in instantiate_class
obj = clazz(*bounded_params.args, **bounded_params.kwargs)
File "/home/stoyan/neuralmonkey/bin/neuralmonkey/trainers/mrt_trainer.py", line 80, in __init__
ind=[[[tf.constant(i),tf.constant(j),sampled_ids[i][j]] for j in range(self.batch_size)] for i in range(self.decoder.max_output_len)]
File "/home/stoyan/neuralmonkey/bin/neuralmonkey/trainers/mrt_trainer.py", line 80, in <listcomp>
ind=[[[tf.constant(i),tf.constant(j),sampled_ids[i][j]] for j in range(self.batch_size)] for i in range(self.decoder.max_output_len)]
File "/home/stoyan/neuralmonkey/bin/neuralmonkey/trainers/mrt_trainer.py", line 80, in <listcomp>
ind=[[[tf.constant(i),tf.constant(j),sampled_ids[i][j]] for j in range(self.batch_size)] for i in range(self.decoder.max_output_len)]
File "/home/stoyan/neurmon/lib/python3.5/site-packages/tensorflow/python/ops/array_ops.py", line 495, in _SliceHelper
name=name)
File "/home/stoyan/neurmon/lib/python3.5/site-packages/tensorflow/python/ops/array_ops.py", line 653, in strided_slice
shrink_axis_mask=shrink_axis_mask)
File "/home/stoyan/neurmon/lib/python3.5/site-packages/tensorflow/python/ops/gen_array_ops.py", line 3688, in strided_slice
shrink_axis_mask=shrink_axis_mask, name=name)
File "/home/stoyan/neurmon/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
op_def=op_def)
File "/home/stoyan/neurmon/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2327, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/stoyan/neurmon/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1226, in __init__
self._traceback = _extract_stack()
InvalidArgumentError (see above for traceback): slice index 49 of dimension 0 out of bounds.
[[Node: sampling/strided_slice_4900 = StridedSlice[Index=DT_INT32, T=DT_INT32, begin_mask=0, ellipsis_mask=0, end_mask=0, new_axis_mask=0, shrink_axis_mask=1, _device="/job:localhost/replica:0/task:0/cpu:0"](sampling/TensorArrayStack_3/TensorArrayGatherV3, sampling/strided_slice_4900/stack, sampling/strided_slice_4900/stack_1, sampling/strided_slice_4900/stack_2)]]
What should this InvalidArgumentError refer to and what goes wrong?
Best,
Stoyan
According to the stack trace, the error comes from this expression in your code:
sampled_ids[i][j]
...but it's hard to tell without context whether it comes from taking the [i] slice or the [j] slice. Presumably one of the tensors in this structure has fewer than 15 (or 49 in the error message) elements in the 0th dimension. Often this can happen if your input data includes word IDs that are not present in the vocabulary used for training the model.
I'm using the following code to log accuracy as the validation measure (TensorFlow 0.10):
validation_metrics = {"accuracy": tf.contrib.metrics.streaming_accuracy}
validation_monitor = tf.contrib.learn.monitors.ValidationMonitor(
input_fn=input_fn_eval,
every_n_steps=FLAGS.eval_every,
# metrics=validation_metrics,
early_stopping_rounds=500,
early_stopping_metric="loss",
early_stopping_metric_minimize=True)
After running, in 'every_n_steps', I see the following lines in the output:
INFO:tensorflow:Validation (step 1000): loss = 1.04875, global_step = 900
The problem is that when 'metrics=validation_metrics' parameter uncomment in the above code, I get the following error in the validation phase:
INFO:tensorflow:Error reported to Coordinator: <type 'exceptions.TypeError'>, Input 'y' of 'Equal' Op has type int64 that does not match type float32 of argument 'x'.
E tensorflow/core/client/tensor_c_api.cc:485] Enqueue operation was cancelled
[[Node: read_batch_features_train/file_name_queue/file_name_queue_EnqueueMany = QueueEnqueueMany[Tcomponents=[DT_STRING], _class=["loc:#read_batch_features_train/file_name_queue"], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](read_batch_features_train/file_name_queue, read_batch_features_train/file_name_queue/RandomShuffle)]]
E tensorflow/core/client/tensor_c_api.cc:485] Enqueue operation was cancelled
[[Node: read_batch_features_train/random_shuffle_queue_EnqueueMany = QueueEnqueueMany[Tcomponents=[DT_STRING, DT_STRING], _class=["loc:#read_batch_features_train/random_shuffle_queue"], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](read_batch_features_train/random_shuffle_queue, read_batch_features_train/read/ReaderReadUpTo, read_batch_features_train/read/ReaderReadUpTo:1)]]
Traceback (most recent call last):
File "udc_train.py", line 74, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv))
File "udc_train.py", line 70, in main
estimator.fit(input_fn=input_fn_train, steps=None, monitors=[validation_monitor])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 240, in fit
max_steps=max_steps)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 578, in _train_model
max_steps=max_steps)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/graph_actions.py", line 280, in _supervised_train
None)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/supervised_session.py", line 270, in run
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/recoverable_session.py", line 54, in run
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/coordinated_session.py", line 70, in run
self._coord.join(self._coordinated_threads_to_join)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 357, in join
six.reraise(*self._exc_info_to_raise)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/coordinated_session.py", line 66, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/monitored_session.py", line 107, in run
induce_stop = monitor.step_end(monitors_step, monitor_outputs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/monitors.py", line 396, in step_end
return self.every_n_step_end(step, output)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/monitors.py", line 687, in every_n_step_end
steps=self.eval_steps, metrics=self.metrics, name=self.name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 356, in evaluate
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 630, in _evaluate_model
eval_dict = self._get_eval_ops(features, targets, metrics)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 877, in _get_eval_ops
result[name] = metric(predictions, targets)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/metrics/python/ops/metric_ops.py", line 432, in streaming_accuracy
is_correct = math_ops.to_float(math_ops.equal(predictions, labels))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 708, in equal
result = _op_def_lib.apply_op("Equal", x=x, y=y, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 468, in apply_op
inferred_from[input_arg.type_attr]))
TypeError: Input 'y' of 'Equal' Op has type int64 that does not match type float32 of argument 'x'.
This looks like a problem with your input_fn and your estimator, which are returning different types for the label.
I have a saved checkpoint generated by graph code in a regular non-distributed setup with the constraint with tf.device('/cpu:0'): (to force model params to reside on CPU instead of GPU).
Now I converted the same code/graph to a distributed setting following the guidelines in TF-Inception.
Now when I try to restore the checkpoint in distributed setup, I get device mismatch errors. Is there a way to override the requirements saved in the checkpoint file or something?
My new distributed code has the Saver and scopes defined as:
if FLAGS.job_name == 'worker':
with tf.device(tf.train.replica_device_setter(
worker_device="/job:worker/task:%d" % FLAGS.task_id,
cluster=cluster_spec)):
# ...same network-graph code... #
restorer = tf.train.Saver()
with tf.Session() as sess:
restorer.restore(sess, 'ResNet-L50.ckpt')
My cluster has one ps and one worker, and both are on localhost. Error line:
tensorflow.python.framework.errors.InvalidArgumentError: Cannot assign a device to node 'save/restore_slice_268/shape_and_slice': Could not satisfy explicit device specification '/job:ps/task:0/device:CPU:0' because no devices matching that specification are registered in this process; available devices: /job:localhost/replica:0/task:0/cpu:0, /job:localhost/replica:0/task:0/gpu:0
[[Node: save/restore_slice_268/shape_and_slice = Const[dtype=DT_STRING, value=Tensor<type: string shape: [] values: >, _device="/job:ps/task:0/device:CPU:0"]()]]
Full error trace:
I tensorflow/core/common_runtime/gpu/gpu_device.cc:756] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Quadro K2200, pci bus id: 0000:01:00.0)
Traceback (most recent call last):
File "dlaunch.py", line 85, in <module>
tf.app.run() # (tf.app.flags parsed here)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv))
File "dlaunch.py", line 81, in main
dtrainer.train(server.target, cluster_spec)
File "/home/muneeb/parkingtf/dtrainer.py", line 88, in train
restorer.restore(sess, 'ResNet-L50.ckpt')
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1103, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 328, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 563, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 636, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 658, in _do_call
e.code)
tensorflow.python.framework.errors.InvalidArgumentError: Cannot assign a device to node 'save/restore_slice_268/shape_and_slice': Could not satisfy explicit device specification '/job:ps/task:0/device:CPU:0' because no devices matching that specification are registered in this process; available devices: /job:localhost/replica:0/task:0/cpu:0, /job:localhost/replica:0/task:0/gpu:0
[[Node: save/restore_slice_268/shape_and_slice = Const[dtype=DT_STRING, value=Tensor<type: string shape: [] values: >, _device="/job:ps/task:0/device:CPU:0"]()]]
Caused by op u'save/restore_slice_268/shape_and_slice', defined at:
File "dlaunch.py", line 85, in <module>
tf.app.run() # (tf.app.flags parsed here)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv))
File "dlaunch.py", line 81, in main
dtrainer.train(server.target, cluster_spec)
File "/home/muneeb/parkingtf/dtrainer.py", line 86, in train
restorer = tf.train.Saver()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 845, in __init__
restore_sequentially=restore_sequentially)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 515, in build
filename_tensor, vars_to_save, restore_sequentially, reshape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 271, in _AddRestoreOps
values = self.restore_op(filename_tensor, vs, preferred_shard)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 186, in restore_op
preferred_shard=preferred_shard)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/io_ops.py", line 201, in _restore_slice
preferred_shard, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 271, in _restore_slice
preferred_shard=preferred_shard, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/op_def_library.py", line 444, in apply_op
as_ref=input_arg.is_ref)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 566, in convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/constant_op.py", line 179, in _constant_tensor_conversion_function
return constant(v, dtype=dtype, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/constant_op.py", line 166, in constant
attrs={"value": tensor_value, "dtype": dtype_value}, name=name).outputs[0]
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2162, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1161, in __init__
self._traceback = _extract_stack()
The following line:
with tf.Session() as sess:
...is responsible for the error. Passing no arguments to tf.Session() creates an in-process session that can only use devices on the local machine. To work in the distributed mode, you should have something like:
# Assuming you created `server = tf.train.Server(...)` earlier.
with tf.Session(server.target) as sess:
...or, if you are connecting to a different process:
# Assuming your server is in a different process.
with tf.Session("grpc://..."):
Note that the devices are not stored in the checkpoint file, but they are being added by the tf.train.replica_device_setter(). Device configuration is a bit tricky right now, and it's something that we're working to simplify.