tensorflow CTC runtime error - tensorflow

I met a runtime error during training when I tried to applied tensorflow built-in CTC loss function (https://www.tensorflow.org/versions/r0.10/api_docs/python/nn/conectionist_temporal_classification__ctc_) to SynthText dataset. http://www.robots.ox.ac.uk/~vgg/data/scenetext/
It said " Not enough time for target transition sequence (required: 4, available: 0)".
Here is some info for the environment: tensorflow version '0.12.0-rc0'.
I am able to apply tensorflow built-in CTC to Synth90K Dataset with great performance.
It seems like the SynthText dataset is not compilable with tensorflow built-in CTC but Synth90K Dataset could.
Please find the error message as reference
step 980, loss = 50.17 (92.1 examples/sec; 0.695 sec/batch)
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Not enough time for target transition sequence (required: 4, available: 0), skipping data instance in batch: 28
Traceback (most recent call last):
File "multi-gpu-train.py", line 305, in
tf.app.run()
File "/home/ubuntu/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 43, in run
sys.exit(main(sys.argv[:1] + flags_passthrough))
File "multi-gpu-train.py", line 301, in main
train()
File "multi-gpu-train.py", line 270, in train
_, loss_value = sess.run([train_op, loss])
File "/home/ubuntu/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 766, in run
run_metadata_ptr)
File "/home/ubuntu/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 964, in _run
feed_dict_string, options, run_metadata)
File "/home/ubuntu/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1014, in _do_run
target_list, options, run_metadata)
File "/home/ubuntu/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1034, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Not enough time for target transition sequence (required: 4, available: 0), skipping data instance in batch: 28
[[Node: tower_0/CTCLoss = CTCLoss[ctc_merge_repeated=true, preprocess_collapse_repeated=false, _device="/job:localhost/replica:0/task:0/cpu:0"](tower_0/transpose_2/_555, tower_0/Where, tower_0/sub_2/_557, tower_0/Sum_1/_559)]]
Caused by op u'tower_0/CTCLoss', defined at:
File "multi-gpu-train.py", line 305, in
tf.app.run()
File "/home/ubuntu/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 43, in run
sys.exit(main(sys.argv[:1] + flags_passthrough))
File "multi-gpu-train.py", line 301, in main
train()
File "multi-gpu-train.py", line 179, in train
loss,logits_op,images,labels = tower_loss(scope)
File "multi-gpu-train.py", line 79, in tower_loss
_ = network2.loss(logits,images, labels)
File "/home/ubuntu/experiments/network2_dev/network2.py", line 61, in loss
out = tf.nn.ctc_loss(logit, to_sparse(y), seq_len, time_major=False)
File "/home/ubuntu/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/ctc_ops.py", line 145, in ctc_loss
ctc_merge_repeated=ctc_merge_repeated)
File "/home/ubuntu/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_ctc_ops.py", line 164, in _ctc_loss
name=name)
File "/home/ubuntu/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op
op_def=op_def)
File "/home/ubuntu/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2240, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/ubuntu/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1128, in init
self._traceback = _extract_stack()
InvalidArgumentError (see above for traceback): Not enough time for target transition sequence (required: 4, available: 0), skipping data instance in batch: 28
[[Node: tower_0/CTCLoss = CTCLoss[ctc_merge_repeated=true, preprocess_collapse_repeated=false, _device="/job:localhost/replica:0/task:0/cpu:0"](tower_0/transpose_2/_555, tower_0/Where, tower_0/sub_2/_557, tower_0/Sum_1/_559)]]

Related

Tensorboard exception with summary.image of shape [-1, 125, 128, 1] of MFCCs

Following this guide, I'm converting a tensor [batch_size, 16000, 1] to an MFCC using the method described in the link:
def gen_spectrogram(wav, sr=16000):
# A 1024-point STFT with frames of 64 ms and 75% overlap.
stfts = tf.contrib.signal.stft(wav, frame_length=1024, frame_step=256, fft_length=1024)
spectrograms = tf.abs(stfts)
# Warp the linear scale spectrograms into the mel-scale.
num_spectrogram_bins = stfts.shape[-1].value
lower_edge_hertz, upper_edge_hertz, num_mel_bins = 80.0, 7600.0, 80
linear_to_mel_weight_matrix = tf.contrib.signal.linear_to_mel_weight_matrix(
num_mel_bins, num_spectrogram_bins,
sample_rate, lower_edge_hertz, upper_edge_hertz)
mel_spectrograms = tf.tensordot(spectrograms, linear_to_mel_weight_matrix, 1)
mel_spectrograms.set_shape(
spectrograms.shape[:-1].concatenate(
linear_to_mel_weight_matrix.shape[-1:]
)
)
# Compute a stabilized log to get log-magnitude mel-scale spectrograms.
log_mel_spectrograms = tf.log(mel_spectrograms + 1e-6)
# Compute MFCCs from log_mel_spectrograms and take the first 13.
return tf.contrib.signal.mfccs_from_log_mel_spectrograms(log_mel_spectrograms)[..., :13]
I then reshape the output of that to [batch_size, 125, 128, 1]. If I send that to a tf.layers.conv2d, things seem to work fine. However, if I try to tf.summary.image, I get the following error:
print(spec)
// => Tensor("spectrogram/Reshape:0", shape=(?, 125, 128, 1), dtype=float32)
tf.summary.image('spec', spec)
Caused by op u'spectrogram/stft/rfft', defined at:
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/Users/rsilveira/rnd/ml-engine/trainer/flatv1.py", line 103, in <module>
runner.run(model_fn)
File "trainer/runner.py", line 88, in run
tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
File "/Library/Python/2.7/site-packages/tensorflow/python/estimator/training.py", line 432, in train_and_evaluate
executor.run_local()
File "/Library/Python/2.7/site-packages/tensorflow/python/estimator/training.py", line 611, in run_local
hooks=train_hooks)
File "/Library/Python/2.7/site-packages/tensorflow/python/estimator/estimator.py", line 302, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/Library/Python/2.7/site-packages/tensorflow/python/estimator/estimator.py", line 711, in _train_model
features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
File "/Library/Python/2.7/site-packages/tensorflow/python/estimator/estimator.py", line 694, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/Users/rsilveira/rnd/ml-engine/trainer/flatv1.py", line 53, in model_fn
spec = gen_spectrogram(x)
File "/Users/rsilveira/rnd/ml-engine/trainer/flatv1.py", line 22, in gen_spectrogram
step,
File "/Library/Python/2.7/site-packages/tensorflow/contrib/signal/python/ops/spectral_ops.py", line 91, in stft
return spectral_ops.rfft(framed_signals, [fft_length])
File "/Library/Python/2.7/site-packages/tensorflow/python/ops/spectral_ops.py", line 136, in _rfft
return fft_fn(input_tensor, fft_length, name)
File "/Library/Python/2.7/site-packages/tensorflow/python/ops/gen_spectral_ops.py", line 619, in rfft
"RFFT", input=input, fft_length=fft_length, name=name)
File "/Library/Python/2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/Library/Python/2.7/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
op_def=op_def)
File "/Library/Python/2.7/site-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InvalidArgumentError (see above for traceback): Input dimension 4 must have length of at least 512 but got: 320
Not sure where to start troubleshooting this. What am I missing here?

Nan in summary histogram for: FirstStageFeatureExtractor

I am trying to run the tensorflow object detection api. I have made a dataset with 1 class using labelImg and then converting the xmls to tfrecord files. Some information:
os: Ubuntu 16.04
gpu: nvidia geforce 1080Ti & 1060
tensorflow version: 1.3.0
training model: faster_rcnn_resnet101_coco (although I have tried others)
Classes: 1
I then run train.py and training begins. When I get ~step 355 I get the error:
INFO:tensorflow:global step 363: loss = 1.4006 (0.294 sec/step)
INFO:tensorflow:Finished training! Saving model to disk.
Traceback (most recent call last):
File "object_detection/train.py", line 163, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "object_detection/train.py", line 159, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/home/ucfadng/tensorflow/models/research/object_detection/trainer.py", line 332, in train
saver=saver)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 767, in train
sv.stop(threads, close_summary_writer=True)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 792, in stop
stop_grace_period_secs=self._stop_grace_secs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 296, in stop_on_exception
yield
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 494, in run
self.run_loop()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 994, in run_loop
self._sv.global_step])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1124, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1321, in _do_run
options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: SecondStageFeatureExtractor/resnet_v1_101/block4/unit_3/bottleneck_v1/conv3/BatchNorm/gamma_1
[[Node: SecondStageFeatureExtractor/resnet_v1_101/block4/unit_3/bottleneck_v1/conv3/BatchNorm/gamma_1 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](SecondStageFeatureExtractor/resnet_v1_101/block4/unit_3/bottleneck_v1/conv3/BatchNorm/gamma_1/tag, SecondStageFeatureExtractor/resnet_v1_101/block4/unit_3/bottleneck_v1/conv3/BatchNorm/gamma/read)]]
[[Node: Loss/RPNLoss/map/TensorArray_2/_1353 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_5239_Loss/RPNLoss/map/TensorArray_2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
Caused by op u'SecondStageFeatureExtractor/resnet_v1_101/block4/unit_3/bottleneck_v1/conv3/BatchNorm/gamma_1', defined at:
File "object_detection/train.py", line 163, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "object_detection/train.py", line 159, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/home/ucfadng/tensorflow/models/research/object_detection/trainer.py", line 295, in train
global_summaries.add(tf.summary.histogram(model_var.op.name, model_var))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/summary/summary.py", line 192, in histogram
tag=tag, values=values, name=scope)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_logging_ops.py", line 129, in _histogram_summary
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2630, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1204, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InvalidArgumentError (see above for traceback): Nan in summary histogram for: SecondStageFeatureExtractor/resnet_v1_101/block4/unit_3/bottleneck_v1/conv3/BatchNorm/gamma_1
[[Node: SecondStageFeatureExtractor/resnet_v1_101/block4/unit_3/bottleneck_v1/conv3/BatchNorm/gamma_1 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](SecondStageFeatureExtractor/resnet_v1_101/block4/unit_3/bottleneck_v1/conv3/BatchNorm/gamma_1/tag, SecondStageFeatureExtractor/resnet_v1_101/block4/unit_3/bottleneck_v1/conv3/BatchNorm/gamma/read)]]
[[Node: Loss/RPNLoss/map/TensorArray_2/_1353 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_5239_Loss/RPNLoss/map/TensorArray_2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
My label map is:
item {
id: 1
name: 'rail'
}
The pipeline used was downloaded from: https://github.com/tensorflow/models/blob/master/research/object_detection/samples/configs/faster_rcnn_resnet101_coco.config
I have had a look at the TFRecord files and look as expected which leads be to think this is something to do with the initial dataset. But what specifically would cause this error?
My dataset consists of 250 clear images containing railway tracks. I have labelled sections of the tracks so each image has ~20/30 labelled objects.
So far the main thing I have tried is lowering learning rates and changing batch size but these did not solve the problem.
Any help in solving this would be very appreciated.
Cheers
I have now updated to version: 1.4.0 and this is no longer a problem. I did not change anything else. Perhaps this was a bug with version: 1.3.0. I will leave this up just in case anyone has the same issue.
This doesn't seem to be an error with tensorflow itself, i believe it has something to do with your system running out of memory.
Setting train to false in batch_norm (within your pipeline.config file) appears to overcome this problem.
It should look something like this:
batch_norm {
decay: 0.999
center: true
scale: true
epsilon: 0.001
train: false
}
Hope this helped.

im2txt UnimplementedError (see above for traceback): TensorArray has size zero when run Training when changing new data

I got an error when I changed new images to train the im2txt model. Don't know why.
Build the model.
bazel build -c opt im2txt/...
bazel-bin/im2txt/train
--input_file_pattern="${MY_DATA_DIR}/train-?????-of-00256"
--inception_checkpoint_file="${INCEPTION_CHECKPOINT}"
--train_dir="${MODEL_DIR}/train"
--train_inception=false
--number_of_steps=10000
It went to error when running below sentence
sequence_length = tf.reduce_sum(self.input_mask, 1)
lstm_outputs, _ = tf.nn.dynamic_rnn(cell=lstm_cell,
inputs=self.seq_embeddings,
sequence_length=sequence_length,
initial_state=initial_state,
dtype=tf.float32,
scope=lstm_scope)
The detail info is below
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:global step 1: loss = 9.5415 (37.21 sec/step)
INFO:tensorflow:global step 2: loss = 6.6332 (12.90 sec/step)
INFO:tensorflow:global step 3: loss = 3.1327 (13.01 sec/step)
INFO:tensorflow:global step 4: loss = 6.2893 (12.04 sec/step)
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.UnimplementedError'>, TensorArray has size zero, but element shape is not fully defined. Currently only static shapes are supported when packing zero-size TensorArrays.
[[Node: OptimizeLoss/gradients/lstm/lstm/TensorArrayUnstack/TensorArrayScatter/TensorArrayScatterV3_grad/TensorArrayGatherV3 = TensorArrayGatherV3[_class=["loc:#lstm/lstm/TensorArray_1"], dtype=DT_FLOAT, element_shape=, _device="/job:localhost/replica:0/task:0/cpu:0"](OptimizeLoss/gradients/lstm/lstm/TensorArrayUnstack/TensorArrayScatter/TensorArrayScatterV3_grad/TensorArrayGrad/TensorArrayGradV3, lstm/lstm/TensorArrayUnstack/range, OptimizeLoss/gradients/lstm/lstm/TensorArrayUnstack/TensorArrayScatter/TensorArrayScatterV3_grad/TensorArrayGrad/gradient_flow)]]
Caused by op u'OptimizeLoss/gradients/lstm/lstm/TensorArrayUnstack/TensorArrayScatter/TensorArrayScatterV3_grad/TensorArrayGatherV3', defined at:
File "/data/projects/content_creator/image2text/im2txt/bazel-bin/im2txt/train.runfiles/im2txt/im2txt/train.py", line 155, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/data/projects/content_creator/image2text/im2txt/bazel-bin/im2txt/train.runfiles/im2txt/im2txt/train.py", line 135, in main
learning_rate_decay_fn=learning_rate_decay_fn)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/layers/python/layers/optimizers.py", line 226, in optimize_loss
colocate_gradients_with_ops=colocate_gradients_with_ops)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 345, in compute_gradients
colocate_gradients_with_ops=colocate_gradients_with_ops)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients_impl.py", line 482, in gradients
in_grads = grad_fn(op, *out_grads)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/tensor_array_grad.py", line 186, in _TensorArrayScatterGrad
grad = g.gather(indices)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/tensor_array_ops.py", line 328, in gather
element_shape=element_shape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 2226, in _tensor_array_gather_v3
element_shape=element_shape, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2327, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1226, in init
self._traceback = _extract_stack()
...which was originally created as op u'lstm/lstm/TensorArrayUnstack/TensorArrayScatter/TensorArrayScatterV3', defined at:
File "/data/projects/content_creator/image2text/im2txt/bazel-bin/im2txt/train.runfiles/im2txt/im2txt/train.py", line 155, in
tf.app.run()
[elided 0 identical lines from previous traceback]
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run
_sys.exit(main(sys.argv[:1] + flags_passthrough))
File "/data/projects/content_creator/image2text/im2txt/bazel-bin/im2txt/train.runfiles/im2txt/im2txt/train.py", line 89, in main
model.build()
File "/data/projects/content_creator/image2text/im2txt/im2txt/show_and_tell_model.py", line 437, in build
self.build_model()
File "/data/projects/content_creator/image2text/im2txt/im2txt/show_and_tell_model.py", line 356, in build_model
scope=lstm_scope)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/rnn.py", line 546, in dynamic_rnn
dtype=dtype)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/rnn.py", line 664, in dynamic_rnn_loop
for ta, input in zip(input_ta, flat_input))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/rnn.py", line 664, in
for ta, input in zip(input_ta, flat_input))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/tensor_array_ops.py", line 380, in unstack
indices=math_ops.range(0, num_elements), value=value, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/tensor_array_ops.py", line 408, in scatter
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 2492, in _tensor_array_scatter_v3
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
op_def=op_def)
UnimplementedError (see above for traceback): TensorArray has size zero, but element shape is not fully defined. Currently only static shapes are supported when packing zero-size TensorArrays.
[[Node: OptimizeLoss/gradients/lstm/lstm/TensorArrayUnstack/TensorArrayScatter/TensorArrayScatterV3_grad/TensorArrayGatherV3 = TensorArrayGatherV3[_class=["loc:#lstm/lstm/TensorArray_1"], dtype=DT_FLOAT, element_shape=, _device="/job:localhost/replica:0/task:0/cpu:0"](OptimizeLoss/gradients/lstm/lstm/TensorArrayUnstack/TensorArrayScatter/TensorArrayScatterV3_grad/TensorArrayGrad/TensorArrayGradV3, lstm/lstm/TensorArrayUnstack/range, OptimizeLoss/gradients/lstm/lstm/TensorArrayUnstack/TensorArrayScatter/TensorArrayScatterV3_grad/TensorArrayGrad/gradient_flow)]]
Traceback (most recent call last):
File "/data/projects/content_creator/image2text/im2txt/bazel-bin/im2txt/train.runfiles/im2txt/im2txt/train.py", line 155, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/data/projects/content_creator/image2text/im2txt/bazel-bin/im2txt/train.runfiles/im2txt/im2txt/train.py", line 152, in main
saver=saver)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 793, in train
train_step_kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 530, in train_step
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 767, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 965, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1015, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1035, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnimplementedError: TensorArray has size zero, but element shape is not fully defined. Currently only static shapes are supported when packing zero-size TensorArrays.
[[Node: OptimizeLoss/gradients/lstm/lstm/TensorArrayUnstack/TensorArrayScatter/TensorArrayScatterV3_grad/TensorArrayGatherV3 = TensorArrayGatherV3[_class=["loc:#lstm/lstm/TensorArray_1"], dtype=DT_FLOAT, element_shape=, _device="/job:localhost/replica:0/task:0/cpu:0"](OptimizeLoss/gradients/lstm/lstm/TensorArrayUnstack/TensorArrayScatter/TensorArrayScatterV3_grad/TensorArrayGrad/TensorArrayGradV3, lstm/lstm/TensorArrayUnstack/range, OptimizeLoss/gradients/lstm/lstm/TensorArrayUnstack/TensorArrayScatter/TensorArrayScatterV3_grad/TensorArrayGrad/gradient_flow)]]
Caused by op u'OptimizeLoss/gradients/lstm/lstm/TensorArrayUnstack/TensorArrayScatter/TensorArrayScatterV3_grad/TensorArrayGatherV3', defined at:
File "/data/projects/content_creator/image2text/im2txt/bazel-bin/im2txt/train.runfiles/im2txt/im2txt/train.py", line 155, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/data/projects/content_creator/image2text/im2txt/bazel-bin/im2txt/train.runfiles/im2txt/im2txt/train.py", line 135, in main
learning_rate_decay_fn=learning_rate_decay_fn)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/layers/python/layers/optimizers.py", line 226, in optimize_loss
colocate_gradients_with_ops=colocate_gradients_with_ops)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 345, in compute_gradients
colocate_gradients_with_ops=colocate_gradients_with_ops)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients_impl.py", line 482, in gradients
in_grads = grad_fn(op, *out_grads)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/tensor_array_grad.py", line 186, in _TensorArrayScatterGrad
grad = g.gather(indices)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/tensor_array_ops.py", line 328, in gather
element_shape=element_shape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 2226, in _tensor_array_gather_v3
element_shape=element_shape, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2327, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1226, in init
self._traceback = _extract_stack()
...which was originally created as op u'lstm/lstm/TensorArrayUnstack/TensorArrayScatter/TensorArrayScatterV3', defined at:
File "/data/projects/content_creator/image2text/im2txt/bazel-bin/im2txt/train.runfiles/im2txt/im2txt/train.py", line 155, in
tf.app.run()
[elided 0 identical lines from previous traceback]
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run
_sys.exit(main(sys.argv[:1] + flags_passthrough))
File "/data/projects/content_creator/image2text/im2txt/bazel-bin/im2txt/train.runfiles/im2txt/im2txt/train.py", line 89, in main
model.build()
File "/data/projects/content_creator/image2text/im2txt/im2txt/show_and_tell_model.py", line 437, in build
self.build_model()
File "/data/projects/content_creator/image2text/im2txt/im2txt/show_and_tell_model.py", line 356, in build_model
scope=lstm_scope)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/rnn.py", line 546, in dynamic_rnn
dtype=dtype)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/rnn.py", line 664, in dynamic_rnn_loop
for ta, input in zip(input_ta, flat_input))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/rnn.py", line 664, in
for ta, input in zip(input_ta, flat_input))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/tensor_array_ops.py", line 380, in unstack
indices=math_ops.range(0, num_elements), value=value, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/tensor_array_ops.py", line 408, in scatter
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 2492, in _tensor_array_scatter_v3
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
op_def=op_def)
UnimplementedError (see above for traceback): TensorArray has size zero, but element shape is not fully defined. Currently only static shapes are supported when packing zero-size TensorArrays.
[[Node: OptimizeLoss/gradients/lstm/lstm/TensorArrayUnstack/TensorArrayScatter/TensorArrayScatterV3_grad/TensorArrayGatherV3 = TensorArrayGatherV3[_class=["loc:#lstm/lstm/TensorArray_1"], dtype=DT_FLOAT, element_shape=, _device="/job:localhost/replica:0/task:0/cpu:0"](OptimizeLoss/gradients/lstm/lstm/TensorArrayUnstack/TensorArrayScatter/TensorArrayScatterV3_grad/TensorArrayGrad/TensorArrayGradV3, lstm/lstm/TensorArrayUnstack/range, OptimizeLoss/gradients/lstm/lstm/TensorArrayUnstack/TensorArrayScatter/TensorArrayScatterV3_grad/TensorArrayGrad/gradient_flow)]]

Continue training after keyboard interrupt?

I was training tensorflow and then i mashed the keyboard for shits and giggles:
INFO:tensorflow:global step 101: loss = 5.1761 (52.61 sec/step)
INFO:tensorflow:global step 102: loss = 4.8679 (18.78 sec/step)
INFO:tensorflow:global step 103: loss = 4.9662 (19.02 sec/step)
INFO:tensorflow:global step 104: loss = 5.1126 (17.36 sec/step)
^C^X^C^[^[^[^[^[
exit
Traceback (most recent call last):
File "/Users/kristoffer/web/im2txt/im2txt/bazel-bin/im2txt/train.runfiles/im2txt/im2txt/train.py", line 114, in <module>
tf.app.run()
File "/Library/Python/2.7/site-packages/tensorflow/python/platform/app.py", line 43, in run
sys.exit(main(sys.argv[:1] + flags_passthrough))
File "/Users/kristoffer/web/im2txt/im2txt/bazel-bin/im2txt/train.runfiles/im2txt/im2txt/train.py", line 110, in main
saver=saver)
File "/Library/Python/2.7/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 782, in train
sess, train_op, global_step, train_step_kwargs)
File "/Library/Python/2.7/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 530, in train_step
run_metadata=run_metadata)
File "/Library/Python/2.7/site-packages/tensorflow/python/client/session.py", line 766, in run
run_metadata_ptr)
File "/Library/Python/2.7/site-packages/tensorflow/python/client/session.py", line 964, in _run
feed_dict_string, options, run_metadata)
File "/Library/Python/2.7/site-packages/tensorflow/python/client/session.py", line 1014, in _do_run
target_list, options, run_metadata)
File "/Library/Python/2.7/site-packages/tensorflow/python/client/session.py", line 1021, in _do_call
return fn(*args)
File "/Library/Python/2.7/site-packages/tensorflow/python/client/session.py", line 1003, in _run_fn
status, run_metadata)
KeyboardInterrupt
Kristoffers-MacBook-Pro:im2txt kristoffer$ logout
Saving session...
...copying shared history...
...saving history...truncating history files...
...completed.
[Process completed]
When I'm trying to start training again, I get the following error:
$ bazel-bin/im2txt/train --input_file_pattern="${MSCOCO_DIR}/train-?????-of-00256" --inception_checkpoint_file="${INCEPTION_CHECKPOINT}" --train_dir="${MODEL_DIR}/train" --train_inception=false --number_of_steps=150
CRITICAL:tensorflow:Found no input files matching /train-?????-of-00256
Traceback (most recent call last):
File "/Users/kristoffer/web/im2txt/im2txt/bazel-bin/im2txt/train.runfiles/im2txt/im2txt/train.py", line 114, in <module>
tf.app.run()
File "/Library/Python/2.7/site-packages/tensorflow/python/platform/app.py", line 43, in run
sys.exit(main(sys.argv[:1] + flags_passthrough))
File "/Users/kristoffer/web/im2txt/im2txt/bazel-bin/im2txt/train.runfiles/im2txt/im2txt/train.py", line 65, in main
model.build()
File "/Users/kristoffer/web/im2txt/im2txt/bazel-bin/im2txt/train.runfiles/im2txt/im2txt/show_and_tell_model.py", line 353, in build
self.build_inputs()
File "/Users/kristoffer/web/im2txt/im2txt/bazel-bin/im2txt/train.runfiles/im2txt/im2txt/show_and_tell_model.py", line 153, in build_inputs
num_reader_threads=self.config.num_input_reader_threads)
File "/Users/kristoffer/web/im2txt/im2txt/bazel-bin/im2txt/train.runfiles/im2txt/im2txt/ops/inputs.py", line 98, in prefetch_input_data
data_files, shuffle=True, capacity=16, name=shard_queue_name)
File "/Library/Python/2.7/site-packages/tensorflow/python/training/input.py", line 211, in string_input_producer
raise ValueError(not_null_err)
ValueError: string_input_producer requires a non-null input tensor
What causes this and what can I do about it? Is there a proper way to pause/cancel a training session? (Tensorflow seems to pick up where it left of if you startup with training 50 steps and then set the steps to 100)
It seems like your problem is caused by that Tensorflow is trying to load a session, which wasn't properly saved when you interrupted your code. Your solutions now would either be not loading the last session when you restart the code (by commenting the loading line), or deleting the saved session files (then it should automatically restart from scratch). It is hard to give a more specific example, since you didn't share your code...

Adding to vocab during Tensorflow seq2seq run

I am using the Tensorflow seq2seq tutorial to play with machine translation. Say I have trained the model for some time and determine that I want to supplement the original vocab with new words to enhance the quality of the model. Is there a way to pause training, add words to the vocabulary, and then resume training from the most recent checkpoint? I attempted to do so but when I began training again I got this error:
Traceback (most recent call last):
File "execute.py", line 405, in <module>
train()
File "execute.py", line 127, in train
model = create_model(sess, False)
File "execute.py", line 108, in create_model
model.saver.restore(session, ckpt.model_checkpoint_path)
File "/home/jrthom18/.local/lib/python2.7/site- packages/tensorflow/python/training/saver.py", line 1388, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/home/jrthom18/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 766, in run
run_metadata_ptr)
File "/home/jrthom18/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 964, in _run
feed_dict_string, options, run_metadata)
File "/home/jrthom18/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1014, in _do_run
target_list, options, run_metadata)
File "/home/jrthom18/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1034, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [384633] rhs shape= [384617]
[[Node: save/Assign_82 = Assign[T=DT_FLOAT, _class=["loc:#proj_b"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/cpu:0"](proj_b, save/RestoreV2_82)]]
Caused by op u'save/Assign_82', defined at:
File "execute.py", line 405, in <module>
train()
File "execute.py", line 127, in train
model = create_model(sess, False)
File "execute.py", line 99, in create_model
model = seq2seq_model.Seq2SeqModel( gConfig['enc_vocab_size'], gConfig['dec_vocab_size'], _buckets, gConfig['layer_size'], gConfig['num_layers'], gConfig['max_gradient_norm'], gConfig['batch_size'], gConfig['learning_rate'], gConfig['learning_rate_decay_factor'], forward_only=forward_only)
File "/home/jrthom18/data/3x256_bs32/easy_seq2seq/seq2seq_model.py", line 166, in __init__
self.saver = tf.train.Saver(tf.global_variables(), keep_checkpoint_every_n_hours=2.0)
File "/home/jrthom18/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1000, in __init__
self.build()
File "/home/jrthom18/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1030, in build
restore_sequentially=self._restore_sequentially)
File "/home/jrthom18/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 624, in build
restore_sequentially, reshape)
File "/home/jrthom18/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 373, in _AddRestoreOps
assign_ops.append(saveable.restore(tensors, shapes))
File "/home/jrthom18/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 130, in restore
self.op.get_shape().is_fully_defined())
File "/home/jrthom18/.local/lib/python2.7/site-packages/tensorflow/python/ops/gen_state_ops.py", line 47, in assign
use_locking=use_locking, name=name)
File "/home/jrthom18/.local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op
op_def=op_def)
File "/home/jrthom18/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2240, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/jrthom18/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1128, in __init__
self._traceback = _extract_stack()
InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [384633] rhs shape= [384617]
[[Node: save/Assign_82 = Assign[T=DT_FLOAT, _class=["loc:#proj_b"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/cpu:0"](proj_b, save/RestoreV2_82)]]
Obviously the new vocab is larger and so the tensor sizes do not match. Is there some way around this?
You can not update your vocab once its set but you can always use shared wordpiece model. It will help you to directly copy out of vocab words form source to target output.