Tensorflow OOM after freeze graph - tensorflow

I'm running a seq2seq model with tf, the inference program runs well when loading parameters from checkpoint file using tf.train.Saver. But after exporting the graph with freeze_graph.py (using tf.framework.graph_util.convert_variables_to_constants()), and import with tf.import_graph_def in the inference program, it got OOM problem.
Here is a part of error log:
W tensorflow/core/common_runtime/bfc_allocator.cc:274] ****************************************************************************************************
W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 4.0KiB. See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:983] Internal: Dst tensor is not initialized.
E tensorflow/core/common_runtime/executor.cc:594] Executor failed to create kernel. Internal: Dst tensor is not initialized.
[[Node: embedding_attention_seq2seq/embedding_attention_decoder/attention_decoder/AttnV_0 = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [1024] values: -0.016628871 -0.2054652 -0.045054652...>, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
Traceback (most recent call last):
File "inference.py", line 88, in console_main
result = list(inference(source_sentence))
File "inference.py", line 54, in inference
for sequence in result:
File "/data/experiment/decoder.py", line 115, in search_best_sequence
State.batch_predict(self.session, self.model, self.context, beam)
File "/data/experiment/decoder.py", line 82, in batch_predict
state_list[0].depth)
File "/data/experiment/seq2seq_model.py", line 452, in batch_feed_decoder
log_softmax, attns, state = session.run(output_fetch, input_feed)
File "/home/.conda/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 767, in run
run_metadata_ptr)
File "/home/.conda/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 966, in _run
feed_dict_string, options, run_metadata)
File "/home/.conda/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1016, in _do_run
target_list, options, run_metadata)
File "/home/.conda/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1036, in _do_call
raise type(e)(node_def, op, message)
InternalError: Dst tensor is not initialized.
[[Node: embedding_attention_seq2seq/embedding_attention_decoder/attention_decoder/AttnV_0 = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [1024] values: -0.016628871 -0.2054652 -0.045054652...>, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
Caused by op u'embedding_attention_seq2seq/embedding_attention_decoder/attention_decoder/AttnV_0', defined at:
File "inference.py", line 169, in <module>
tf.app.run()
File "/home/.conda/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 44, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "inference.py", line 165, in main
console_main(session)
File "inference.py", line 66, in console_main
model = create_model(session, False)
File "/data/experiment/model.py", line 145, in create_model
tensor_name_pickle=tensor_name_pickle)
File "/data/experiment/seq2seq_model.py", line 106, in __init__
tf.import_graph_def(graph_def, name="")
File "/home/.conda/lib/python2.7/site-packages/tensorflow/python/framework/importer.py", line 287, in import_graph_def
op_def=op_def)
File "/home/.conda/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2395, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/.conda/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1264, in __init__
self._traceback = _extract_stack()
InternalError (see above for traceback): Dst tensor is not initialized.
[[Node: embedding_attention_seq2seq/embedding_attention_decoder/attention_decoder/AttnV_0 = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [1024] values: -0.016628871 -0.2054652 -0.045054652...>, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
I thought it might cause by the memory issue of tf.Constant. Does someone have experience with this problem?

I had the same issue but when trying to load and run the inference from a C++ application using the C API. After a lot of twiddling and testing it appeared the culprit was the frozen graph and freeze_graph.py itself. It's probably a bug of some kind. There are actually multiple issue reports on github's TF repo, but they were just closed due to lack of activity, e.g. here and here. I guess apparent bugs of model freezing aren't of any priority.
In my case the model .pb file was around 500mb and it took around 10Gb of RAM while running a session. Not only did it occupy an insane amount of RAM, it was actually orders of magnitudes slower that way.
When I switched to loading just a SavedModel directory everything went to normal. I'm not sure how to achieve that in python, but for C code I replaced a TF_GraphImportGraphDef() call with TF_LoadSessionFromSavedModel().
I used TF v1.14.0. The library is built with Bazel by me, not the stock version. I could provide some details here and there if anybody was interested. Just not sure where to start, I had many trials and errors.

Related

TensorFlow `AssertionError` on `fit()` method

I get a AssertionError when passing my tf.Dataset into the tf.Keras Model's fit() method.
I am using tensorflow==2.0.0.
I checked if my dataset works by:
# for x,y in dataset:
# print(x.shape, y.shape)
which yields correct shapes for models input data.
The full trace is:
Traceback (most recent call last):
File "/anaconda3/envs/ml36/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/anaconda3/envs/ml36/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/me/train.py", line 102, in <module>
start_training(**arguments)
File "/me/train.py", line 66, in start_training
steps_per_epoch=TRAIN_STEPS_PER_EPOCH,
File "/anaconda3/envs/ml36/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training.py", line 728, in fit
use_multiprocessing=use_multiprocessing)
File "/anaconda3/envs/ml36/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_distributed.py", line 789, in fit
*args, **kwargs)
File "/anaconda3/envs/ml36/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_distributed.py", line 776, in wrapper
mode=dc.CoordinatorMode.INDEPENDENT_WORKER)
File "/anaconda3/envs/ml36/lib/python3.6/site-packages/tensorflow_core/python/distribute/distribute_coordinator.py", line 782, in run_distribute_coordinator
rpc_layer)
File "/anaconda3/envs/ml36/lib/python3.6/site-packages/tensorflow_core/python/distribute/distribute_coordinator.py", line 344, in _run_single_worker
assert strategy
AssertionError
I had the same error when running gcloud ai-platform local train on the final release of tensorflow 2.0.0. However, it was working on earlier releases. Try to downgrade to 2.0.0b1:
pip install tensorflow==2.0.0b1
--
Also found that you don't get this error if you run directly in python or if you run it in the cloud.
If you are training locally without using any distributed strategies you can add following lines to your code to solve this issue:
TF_CONFIG = os.environ.get('TF_CONFIG')
if TF_CONFIG:
os.environ.pop('TF_CONFIG')

Invalid Argument Error Tensorflow Object Detection Training

I am training tensor flow object detection following the tensor flow API. I have trained many models in the past using the exact same steps. This model however keeps giving me the error message below. The error message references
InvalidArgumentError: image_size must contain 3 elements[4]
I searched the error and found
InvalidArgumentError: image_size must contain 3 elements[4] #3349
which shows the error and gives the solution of checking to make sure that all images are RGB. I used the code provided in that thread to check all images. I found about 15 images that were not RGB. I removed the images and the corresponding xml files. I recompiled the csv files and the tfrecord files and restarted the training. I received the error message again. I then tried to start the training over without resuming from the last checkpoint and I still received the error. The error does not happen on a regular basis. Sometimes the model will go for several thousand steps before a failure. I have also tried removing the random crop parameter from the pipeline.config file which had no affect.
Any help is appreciated.
Error Message:
INFO:tensorflow:global_step/sec: 2.03361
INFO:tensorflow:global step 4039: loss = 6.2836 (0.512 sec/step)
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, image_size must contain 3 elements[4]
[[Node: cond_2/RandomCropImage/sample_distorted_bounding_box/SampleDistortedBoundingBoxV2 = SampleDistortedBoundingBoxV2[T=DT_INT32, area_range=[0.1, 1], aspect_ratio_range=[0.5, 2],max_attempts=100, seed=0, seed2=0, use_image_if_no_bounding_boxes=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](cond_2/RandomCropImage/Shape, cond_2/RandomCropImage/ExpandDims, cond_2/RandomCropImage/PruneNonOverlappingBoxes/Const)]]
INFO:tensorflow:Recording summary at step 4039.
INFO:tensorflow:global step 4040: loss = 4.6984 (0.880 sec/step)
INFO:tensorflow:Finished training! Saving model to disk.
Traceback (most recent call last):
File "/floyd/object_detection/legacy/train.py", line 184, in <module>
tf.app.run()
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, inrun
_sys.exit(main(argv))
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 250, in new_func
return func(*args, **kwargs)
File "/floyd/object_detection/legacy/train.py", line 180, in main
graph_hook_fn=graph_rewriter_fn)
File "/floyd/object_detection/legacy/trainer.py", line 415, in train
saver=saver)
File "/usr/local/lib/python3.6/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 785, in train
ignore_live_threads=ignore_live_threads)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/training/supervisor.py", line 833, in stop
ignore_live_threads=ignore_live_threads)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line389, in join
six.reraise(*self._exc_info_to_raise)
File "/usr/local/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/training/queue_runner_impl.py", line 252, in _run
enqueue_callable()
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1244,in _single_operation_run
self._call_tf_sessionrun(None, {}, [], target_list, None)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1409,in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: image_size must contain 3 elements[4]
[[Node: cond_2/RandomCropImage/sample_distorted_bounding_box/SampleDistortedBoundingBoxV2 = SampleDistortedBoundingBoxV2[T=DT_INT32, area_range=[0.1, 1], aspect_ratio_range=[0.5, 2],max_attempts=100, seed=0, seed2=0, use_image_if_no_bounding_boxes=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](cond_2/RandomCropImage/Shape, cond_2/RandomCropImage/ExpandDims, cond_2/RandomCropImage/PruneNonOverlappingBoxes/Const)]]
Thanks in advance.
so it was the RGB image problem. I had checked the images and removed the non RGB images and recreated the records, but the model was still pointing to the old records because the paths were very similar, I did not notice.

cudnn handle not created, solve is clear, how to implement

Hello I am using ubuntu 16.04, ROS kinetic, tensorflow 1.13.1.
My aim to combine an ensenso n35 camera with its rosdriver to the mask rcnn node created for ROS. I have altered the original code for the mask rcnn node so that it takes a grayscale input an stacks it onto itself. I have actually already verified this to work by using a virtual version of the ensenso camera.The sdk contains an app that sets this up. It outputs a white image, however, this should not be an issue for testing functionality. The problem arrises when I attacht the actual camera to the system. This gives the following error:
2019-03-28 13:30:43.113919: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2019-03-28 13:30:43.872243: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-03-28 13:30:43.874466: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
None
None
Traceback (most recent call last):
File "/home/riwo-rack-pc/ROS_Mask_rcnn/src/mask_rcnn_ros/nodes/mask_rcnn_node", line 182, in <module>
main()
File "/home/riwo-rack-pc/ROS_Mask_rcnn/src/mask_rcnn_ros/nodes/mask_rcnn_node", line 179, in main
node.run()
File "/home/riwo-rack-pc/ROS_Mask_rcnn/src/mask_rcnn_ros/nodes/mask_rcnn_node", line 104, in run
results = self._model.detect([np_image], verbose=0)
File "/home/riwo-rack-pc/ROS_Mask_rcnn/src/mask_rcnn_ros/src/mask_rcnn_ros/model.py", line 2340, in detect
self.keras_model.predict([molded_images, image_metas], verbose=0)
File "/home/riwo-rack-pc/.local/lib/python2.7/site-packages/keras/engine/training.py", line 1790, in predict
verbose=verbose, steps=steps)
File "/home/riwo-rack-pc/.local/lib/python2.7/site-packages/keras/engine/training.py", line 1299, in _predict_loop
batch_outs = f(ins_batch)
File "/home/riwo-rack-pc/.local/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 2357, in __call__
**self.session_kwargs)
File "/home/riwo-rack-pc/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/home/riwo-rack-pc/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1156, in _run
feed_dict_tensor, options, run_metadata)
File "/home/riwo-rack-pc/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1334, in _do_run
run_metadata)
File "/home/riwo-rack-pc/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1354, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node conv1/convolution (defined at /home/riwo-rack-pc/.local/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py:3195) ]]
[[node ROI/strided_slice_20 (defined at /home/riwo-rack-pc/ROS_Mask_rcnn/src/mask_rcnn_ros/src/mask_rcnn_ros/utils.py:687) ]]
Caused by op u'conv1/convolution', defined at:
File "/home/riwo-rack-pc/ROS_Mask_rcnn/src/mask_rcnn_ros/nodes/mask_rcnn_node", line 182, in <module>
main()
File "/home/riwo-rack-pc/ROS_Mask_rcnn/src/mask_rcnn_ros/nodes/mask_rcnn_node", line 178, in main
node = MaskRCNNNode()
File "/home/riwo-rack-pc/ROS_Mask_rcnn/src/mask_rcnn_ros/nodes/mask_rcnn_node", line 65, in __init__
config=config)
File "/home/riwo-rack-pc/ROS_Mask_rcnn/src/mask_rcnn_ros/src/mask_rcnn_ros/model.py", line 1735, in __init__
self.keras_model = self.build(mode=mode, config=config)
File "/home/riwo-rack-pc/ROS_Mask_rcnn/src/mask_rcnn_ros/src/mask_rcnn_ros/model.py", line 1791, in build
_, C2, C3, C4, C5 = resnet_graph(input_image, "resnet101", stage5=True)
File "/home/riwo-rack-pc/ROS_Mask_rcnn/src/mask_rcnn_ros/src/mask_rcnn_ros/model.py", line 152, in resnet_graph
x = KL.Conv2D(64, (7, 7), strides=(2, 2), name='conv1', use_bias=True)(x)
File "/home/riwo-rack-pc/.local/lib/python2.7/site-packages/keras/engine/topology.py", line 603, in __call__
output = self.call(inputs, **kwargs)
File "/home/riwo-rack-pc/.local/lib/python2.7/site-packages/keras/layers/convolutional.py", line 164, in call
dilation_rate=self.dilation_rate)
File "/home/riwo-rack-pc/.local/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 3195, in conv2d
data_format=tf_data_format)
File "/home/riwo-rack-pc/.local/lib/python2.7/site-packages/tensorflow/python/ops/nn_ops.py", line 851, in convolution
return op(input, filter)
File "/home/riwo-rack-pc/.local/lib/python2.7/site-packages/tensorflow/python/ops/nn_ops.py", line 966, in __call__
return self.conv_op(inp, filter)
File "/home/riwo-rack-pc/.local/lib/python2.7/site-packages/tensorflow/python/ops/nn_ops.py", line 591, in __call__
return self.call(inp, filter)
File "/home/riwo-rack-pc/.local/lib/python2.7/site-packages/tensorflow/python/ops/nn_ops.py", line 208, in __call__
name=self.name)
File "/home/riwo-rack-pc/.local/lib/python2.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 1026, in conv2d
data_format=data_format, dilations=dilations, name=name)
File "/home/riwo-rack-pc/.local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/home/riwo-rack-pc/.local/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/home/riwo-rack-pc/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
op_def=op_def)
File "/home/riwo-rack-pc/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
self._traceback = tf_stack.extract_stack()
UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node conv1/convolution (defined at /home/riwo-rack-pc/.local/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py:3195) ]]
[[node ROI/strided_slice_20 (defined at /home/riwo-rack-pc/ROS_Mask_rcnn/src/mask_rcnn_ros/src/mask_rcnn_ros/utils.py:687) ]]
I can't, for the life of me, figure out where this goes into error nor why. I was ensured that the virtual camera outputs the same data as the actual would, but the error only occurs when using the actual camera.
What i have found so far is that the following statement should be added somewhere in the code but I can not think of, or find, the proper placement for it:
config_pb2.GPUOptions(allow_growth=True)
Help would be much appreciated! Also if anyone thinks this question is better asked elsewhere I will move it there.
I have seen that you are using python=2.7, in the Mask-Rcnn documentation requires.
python_requires='>=3.4',
Other things you should consider.
If you're trying to use your gpu you shloud use tensorflow-gpu.
$ pip install tensorflow-gpu

Does Google Cloud ML support GPU?

I'm testing Google Cloud ML for speeding up my ML model using Tensorflow.
Unfortunately, it seems like Google Cloud ML is extremely slow. My Mainstream-Level PC is at least 10x faster than Google Cloud ML.
I doubt it uses GPU, so I did a test. I modified a sample code to force using GPU.
diff --git a/mnist/trainable/trainer/task.py b/mnist/trainable/trainer/task.py
index 9acb349..a64a11d 100644
--- a/mnist/trainable/trainer/task.py
+++ b/mnist/trainable/trainer/task.py
## -131,11 +131,12 ## def run_training():
images_placeholder, labels_placeholder = placeholder_inputs(
FLAGS.batch_size)
- # Build a Graph that computes predictions from the inference model.
- logits = mnist.inference(images_placeholder, FLAGS.hidden1, FLAGS.hidden2)
+ with tf.device("/gpu:0"):
+ # Build a Graph that computes predictions from the inference model.
+ logits = mnist.inference(images_placeholder, FLAGS.hidden1, FLAGS.hidden2)
- # Add to the Graph the Ops for loss calculation.
- loss = mnist.loss(logits, labels_placeholder)
+ # Add to the Graph the Ops for loss calculation.
+ loss = mnist.loss(logits, labels_placeholder)
# Add to the Graph the Ops that calculate and apply gradients.
train_op = mnist.training(loss, FLAGS.learning_rate)
This training code works at my PC (gcloud beta ml local train ...) but not in cloud. It gives errors like this:
"Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 239, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 43, in run
sys.exit(main(sys.argv[:1] + flags_passthrough))
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 235, in main
run_training()
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 177, in run_training
sess.run(init)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 766, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 964, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1014, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1034, in _do_call
raise type(e)(node_def, op, message)
InvalidArgumentError: Cannot assign a device to node 'softmax_linear/biases': Could not satisfy explicit device specification '/device:GPU:0' because no devices matching that specification are registered in this process; available devices: /job:localhost/replica:0/task:0/cpu:0
Colocation Debug Info:
Colocation group had the following types and devices:
ApplyGradientDescent: CPU
Identity: CPU
Assign: CPU
Variable: CPU
[[Node: softmax_linear/biases = Variable[container="", dtype=DT_FLOAT, shape=[10], shared_name="", _device="/device:GPU:0"]()]]
Does Google Cloud ML support GPU?
GPUs are now in Beta and all Cloud ML customers have access.
Here are the docs for using GPUs with Cloud ML.

TensorFlow CIFAR10 cifar10_eval.py throws error: Compute status: Invalid argument: Assign requires shapes of both tensors to match

I am running the SVHN data set on the CIFAR10 example provided in the TensorFlow packages. All I did was just to change the source directories for the data, and modify a few lines of code here and there. I can successfully train the network.
However, when I run svhn_eval.py (the equivalent of cifar10_eval.py, names changed so I know how to organize my files), I get this error of assign requires shape of both tensors to match. I guess that the problem could be due to
saver.restore(sess, ckpt.model_checkpoint_path)
as the trace ends there and goes deep into the other files of TensorFlow. Does anyone know how to solve this?
W tensorflow/core/common_runtime/executor.cc:1076] 0x1a5bad0 Compute status: Invalid argument: Assign requires shapes of both tensors to match. lhs shape= [2304,384] rhs shape= [4096,384]
[[Node: save/Assign_5 = Assign[T=DT_FLOAT, use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/cpu:0"](local3/weights, save/restore_slice_5)]]
Traceback (most recent call last):
File "/home/samuelchin/svhn/svhn_eval.py", line 161, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/default/_app.py", line 30, in run
sys.exit(main(sys.argv))
File "/home/samuelchin/svhn/svhn_eval.py", line 157, in main
evaluate()
File "/home/samuelchin/svhn/svhn_eval.py", line 147, in evaluate
eval_once(saver, summary_writer, top_k_op, summary_op)
File "/home/samuelchin/svhn/svhn_eval.py", line 78, in eval_once
saver.restore(sess, ckpt.model_checkpoint_path)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 891, in restore
sess.run([self._restore_op_name], {self._filename_tensor_name: save_path})
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 373, in run
results = self._do_run(target_list, unique_fetch_targets, feed_dict_string)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 449, in _do_run
e.code)
tensorflow.python.framework.errors.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [2304,384] rhs shape= [4096,384]
[[Node: save/Assign_5 = Assign[T=DT_FLOAT, use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/cpu:0"](local3/weights, save/restore_slice_5)]]
Caused by op u'save/Assign_5', defined at:
File "/home/samuelchin/svhn/svhn_eval.py", line 161, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/default/_app.py", line 30, in run
sys.exit(main(sys.argv))
File "/home/samuelchin/svhn/svhn_eval.py", line 157, in main
evaluate()
File "/home/samuelchin/svhn/svhn_eval.py", line 137, in evaluate
saver = tf.train.Saver(variables_to_restore)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 713, in __init__
restore_sequentially=restore_sequentially)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 432, in build
filename_tensor, vars_to_save, restore_sequentially, reshape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 202, in _AddRestoreOps
validate_shape=not reshape))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_state_ops.py", line 40, in assign
use_locking=use_locking, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/op_def_library.py", line 660, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1850, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1049, in __init__
self._traceback = _extract_stack()
EDIT 1: The lines of code that I changed are in distorted_inputs. In the the original CIFAR10, there was random crop from a 32x32 to a 24x24 picture. However, in the SVHN implementation, I input 32x32 images. Based on the output error, we can sort of figure out what's wrong.
lhs shape= [2304,384] rhs shape= [4096,384]
2304 = 24 * 24 * 4
4096 = 32 * 32 * 4
The question we have to ask ourselves now is, why multiply by 4?
The solution is that cifar10.py has a variable called IMAGE_SIZE. I left it as 24, because I thought it would not affect anything. However, what happens is that when you try and run the test set, the inputs are cropped to a size of IMAGE_SIZE x IMAGE_SIZE.
Therefore, when that wasn't changed, the tensor dimensions do not match. Changing that variable to 32 will do the trick.