I am running the resenet model on EC2 g2(NVIDIA GRID K520) instance and seeing a OOM Error. I have tried various combinations of removing the code that uses a GPU, prefixing CUDA_VISIBLE_DEVICES='0' and also reducing the batch_size to 64. I am still failing to start the training. Can you help?
W tensorflow/core/common_runtime/bfc_allocator.cc:270] **********************x***************************************************************************xx
W tensorflow/core/common_runtime/bfc_allocator.cc:271] Ran out of memory trying to allocate 196.00MiB. See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:936] Resource exhausted: OOM when allocating tensor with shape[64,16,224,224]
E tensorflow/core/client/tensor_c_api.cc:485] OOM when allocating tensor with shape[64,16,224,224]
[[Node: unit_1_2/sub1/conv1/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](unit_1_2/residual_only_activation/leaky_relu, unit_1_2/sub1/conv1/DW/read)]]
[[Node: train_step/update/_1561 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_10115_train_step/update", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
Traceback (most recent call last):
File "./resnet_main.py", line 203, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv))
File "./resnet_main.py", line 197, in main
train(hps)
File "./resnet_main.py", line 82, in train
feed_dict={model.lrn_rate: lrn_rate})
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 382, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 655, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 723, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 743, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.ResourceExhaustedError: OOM when allocating tensor with shape[64,16,224,224]
[[Node: unit_1_2/sub1/conv1/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](unit_1_2/residual_only_activation/leaky_relu, unit_1_2/sub1/conv1/DW/read)]]
[[Node: train_step/update/_1561 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_10115_train_step/update", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
Caused by op u'unit_1_2/sub1/conv1/Conv2D', defined at:
File "./resnet_main.py", line 203, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv))
File "./resnet_main.py", line 197, in main
train(hps)
File "./resnet_main.py", line 64, in train
model.build_graph()
File "/home/ubuntu/indu/tf-benchmark/resnet/resnet_model.py", line 59, in build_graph
self._build_model()
File "/home/ubuntu/indu/tf-benchmark/resnet/resnet_model.py", line 94, in _build_model
x = res_func(x, filters[1], filters[1], self._stride_arr(1), False)
File "/home/ubuntu/indu/tf-benchmark/resnet/resnet_model.py", line 208, in _residual
x = self._conv('conv1', x, 3, in_filter, out_filter, stride)
File "/home/ubuntu/indu/tf-benchmark/resnet/resnet_model.py", line 279, in _conv
return tf.nn.conv2d(x, kernel, strides, padding='SAME')
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 394, in conv2d
data_format=data_format, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 703, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2310, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1232, in __init__
self._traceback = _extract_stack()
The NVIDIA GRID K520 has 8GB of memory (link). I have successfully trained ResNet models on a NVIDIA GPU with 12GB of memory. As the error suggests, TensorFlow attempts to put all of the networks weights in the GPU memory and fails. I believe you have a few options:
Train only on the CPU, as mentioned in the comments, assuming your CPU has more than 8GB of memory. This is not recommended.
Train a different network with fewer parameters. Several networks have been released since Resnet, such as Inception-v4, Inception-ResNet, with fewer parameters and comparable accuracy. This option costs nothing to try!
Buy a GPU with more memory. Easiest option if you have the money.
Buy another GPU with the same memory and train the bottom half of the network on one, and the top half of the network on the other. The difficulty in communicating between the GPUs makes this option less desirable.
I hope this helps you and others that run into similar memory issues.
Related
I am trying to train a SSD Lite + MobileNetv2 using the model_main.py in the tensorflow/models/research/objectdetection but I'm getting the following error
Assign requires shapes of both tensors to match. lhs shape= [1,1,256,256] rhs shape= [1,1,1280,256] [[Node: save/Assign_348 = Assign[T=DT_FLOAT, _class=["loc:#FeatureExtractor/MobilenetV2/layer_19_1_Conv2d_2_1x1_256/weights"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](FeatureExtractor/MobilenetV2/layer_19_1_Conv2d_2_1x1_256/weights, save/RestoreV2:348)]]
The full log is given here,
python E:\Documents\Projects\tensorflow\models\research\object_detection\model_main.py --alsologtostderr --pipeline_config_path=experiments/training_/ssdlite_mobilenet_v2_coco.config --model_dir=experiments/training_/ --num_train_steps=50000 --NUM_EVAL_STEPS=2000
WARNING:tensorflow:Forced number of epochs for all eval validations to be 1.
W1123 09:11:18.686478 7432 tf_logging.py:125] Forced number of epochs for all eval validations to be 1.
WARNING:tensorflow:Expected number of evaluation epochs is 1, but instead encountered eval_on_train_input_config.num_epochs = 0. Overwriting num_epochs to 1.
W1123 09:11:18.687448 7432 tf_logging.py:125] Expected number of evaluation epochs is 1, but instead encountered eval_on_train_input_config.num_epochs = 0. Overwriting num_epochs to 1.
WARNING:tensorflow:Estimator's model_fn (<function create_model_fn..model_fn at 0x000001E641D31268>) includes params argument, but params are not passed to Estimator.
W1123 09:11:18.688472 7432 tf_logging.py:125] Estimator's model_fn (<function create_model_fn..model_fn at 0x000001E641D31268>) includes params argument, but params are not passed to Estimator.
2018-11-23 09:11:23.084879: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2018-11-23 09:11:23.365711: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1392] Found device 0 with properties:
name: GeForce GTX 1060 6GB major: 6 minor: 1 memoryClockRate(GHz): 1.835
pciBusID: 0000:01:00.0
totalMemory: 6.00GiB freeMemory: 4.97GiB
2018-11-23 09:11:23.372771: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1471] Adding visible gpu devices: 0
2018-11-23 09:11:24.001841: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:952] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-23 09:11:24.004967: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:958] 0
2018-11-23 09:11:24.007058: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:971] 0: N
2018-11-23 09:11:24.009175: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4741 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0, compute capability: 6.1)
Traceback (most recent call last):
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\client\session.py", line 1322, in _do_call
return fn(*args)
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\client\session.py", line 1307, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\client\session.py", line 1409, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [1,1,256,256] rhs shape= [1,1,1280,256]
[[Node: save/Assign_348 = Assign[T=DT_FLOAT, _class=["loc:#FeatureExtractor/MobilenetV2/layer_19_1_Conv2d_2_1x1_256/weights"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](FeatureExtractor/MobilenetV2/layer_19_1_Conv2d_2_1x1_256/weights, save/RestoreV2:348)]]
[[Node: save/RestoreV2/_599 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_728_save/RestoreV2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "E:\Documents\Projects\tensorflow\models\research\object_detection\model_main.py", line 109, in
tf.app.run()
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\platform\app.py", line 125, in run
_sys.exit(main(argv))
File "E:\Documents\Projects\tensorflow\models\research\object_detection\model_main.py", line 105, in main
tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\estimator\training.py", line 447, in train_and_evaluate
return executor.run()
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\estimator\training.py", line 531, in run
return self.run_local()
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\estimator\training.py", line 681, in run_local
eval_result, export_results = evaluator.evaluate_and_export()
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\estimator\training.py", line 886, in evaluate_and_export
hooks=self._eval_spec.hooks)
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\estimator\estimator.py", line 460, in evaluate
output_dir=self.eval_dir(name))
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\estimator\estimator.py", line 1386, in _evaluate_run
config=self._session_config)
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\training\evaluation.py", line 209, in _evaluate_once
session_creator=session_creator, hooks=hooks) as session:
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\training\monitored_session.py", line 826, in init
stop_grace_period_secs=stop_grace_period_secs)
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\training\monitored_session.py", line 549, in init
self._sess = _RecoverableSession(self._coordinated_creator)
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1012, in init
_WrappedSession.init(self, self._create_session())
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1017, in _create_session
return self._sess_creator.create_session()
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\training\monitored_session.py", line 706, in create_session
self.tf_sess = self._session_creator.create_session()
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\training\monitored_session.py", line 477, in create_session
init_fn=self._scaffold.init_fn)
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\training\session_manager.py", line 281, in prepare_session
config=config)
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\training\session_manager.py", line 195, in _restore_checkpoint
saver.restore(sess, checkpoint_filename_with_path)
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\training\saver.py", line 1752, in restore
{self.saver_def.filename_tensor_name: save_path})
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\client\session.py", line 900, in run
run_metadata_ptr)
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\client\session.py", line 1135, in _run
feed_dict_tensor, options, run_metadata)
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\client\session.py", line 1316, in _do_run
run_metadata)
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\client\session.py", line 1335, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [1,1,256,256] rhs shape= [1,1,1280,256]
[[Node: save/Assign_348 = Assign[T=DT_FLOAT, _class=["loc:#FeatureExtractor/MobilenetV2/layer_19_1_Conv2d_2_1x1_256/weights"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](FeatureExtractor/MobilenetV2/layer_19_1_Conv2d_2_1x1_256/weights, save/RestoreV2:348)]]
[[Node: save/RestoreV2/_599 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_728_save/RestoreV2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]
Caused by op 'save/Assign_348', defined at:
File "E:\Documents\Projects\tensorflow\models\research\object_detection\model_main.py", line 109, in
tf.app.run()
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\platform\app.py", line 125, in run
_sys.exit(main(argv))
File "E:\Documents\Projects\tensorflow\models\research\object_detection\model_main.py", line 105, in main
tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\estimator\training.py", line 447, in train_and_evaluate
return executor.run()
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\estimator\training.py", line 531, in run
return self.run_local()
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\estimator\training.py", line 681, in run_local
eval_result, export_results = evaluator.evaluate_and_export()
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\estimator\training.py", line 886, in evaluate_and_export
hooks=self._eval_spec.hooks)
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\estimator\estimator.py", line 460, in evaluate
output_dir=self.eval_dir(name))
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\estimator\estimator.py", line 1386, in _evaluate_run
config=self._session_config)
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\training\evaluation.py", line 209, in _evaluate_once
session_creator=session_creator, hooks=hooks) as session:
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\training\monitored_session.py", line 826, in init
stop_grace_period_secs=stop_grace_period_secs)
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\training\monitored_session.py", line 549, in init
self._sess = _RecoverableSession(self._coordinated_creator)
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1012, in init
_WrappedSession.init(self, self._create_session())
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1017, in _create_session
return self._sess_creator.create_session()
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\training\monitored_session.py", line 706, in create_session
self.tf_sess = self._session_creator.create_session()
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\training\monitored_session.py", line 468, in create_session
self._scaffold.finalize()
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\training\monitored_session.py", line 212, in finalize
self._saver = training_saver._get_saver_or_default() # pylint: disable=protected-access
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\training\saver.py", line 856, in _get_saver_or_default
saver = Saver(sharded=True, allow_empty=True)
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\training\saver.py", line 1284, in init
self.build()
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\training\saver.py", line 1296, in build
self._build(self._filename, build_save=True, build_restore=True)
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\training\saver.py", line 1333, in _build
build_save=build_save, build_restore=build_restore)
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\training\saver.py", line 775, in _build_internal
restore_sequentially, reshape)
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\training\saver.py", line 453, in _AddShardedRestoreOps
name="restore_shard"))
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\training\saver.py", line 422, in _AddRestoreOps
assign_ops.append(saveable.restore(saveable_tensors, shapes))
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\training\saver.py", line 113, in restore
self.op.get_shape().is_fully_defined())
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\ops\state_ops.py", line 219, in assign
validate_shape=validate_shape)
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\ops\gen_state_ops.py", line 63, in assign
use_locking=use_locking, name=name)
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\framework\ops.py", line 3414, in create_op
op_def=op_def)
File "D:\ProgramData\Anaconda3\envs\tfod\lib\site-packages\tensorflow\python\framework\ops.py", line 1740, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [1,1,256,256] rhs shape= [1,1,1280,256]
[[Node: save/Assign_348 = Assign[T=DT_FLOAT, _class=["loc:#FeatureExtractor/MobilenetV2/layer_19_1_Conv2d_2_1x1_256/weights"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](FeatureExtractor/MobilenetV2/layer_19_1_Conv2d_2_1x1_256/weights, save/RestoreV2:348)]]
[[Node: save/RestoreV2/_599 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_728_save/RestoreV2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]
My config can be found here
Also note that I originally posted it on Tensorflow/models/issues but the team suggested that I post here on SO
The error disappeared after I upgraded to TF 1.12 and cuDNN 7.3 for my CUDA v9
This is the command, customise it according to the parameters used in training.
python deeplab/export_model.py --checkpoint_path=/code/models/research/deeplab/weights_input_level_17/model.ckpt-22000 --export_path=/code/models/research/deeplab/frozen_weights_level_17/frozen_inference_graph.pb --model_variant="xception_65" --atrous_rates=6 --atrous_rates=12 --atrous_rates=18 --output_stride=16 --crop_size=2048 --crop_size=2048 --num_classes=3
I trained my model to segment bridges and followed the pascal dataset format. So ideally I had only one class but since we have two default classes 1. Background 2.Ignore class and 3.Bridge so my total classes become 3.
Hey, Guys please use this config to export your deeplabv3plus mode. It worked for me.
Your error is because of the dimensions in your image, you should make them equal as the ones you used for training
I am trying to run the tensorflow object detection api. I have made a dataset with 1 class using labelImg and then converting the xmls to tfrecord files. Some information:
os: Ubuntu 16.04
gpu: nvidia geforce 1080Ti & 1060
tensorflow version: 1.3.0
training model: faster_rcnn_resnet101_coco (although I have tried others)
Classes: 1
I then run train.py and training begins. When I get ~step 355 I get the error:
INFO:tensorflow:global step 363: loss = 1.4006 (0.294 sec/step)
INFO:tensorflow:Finished training! Saving model to disk.
Traceback (most recent call last):
File "object_detection/train.py", line 163, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "object_detection/train.py", line 159, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/home/ucfadng/tensorflow/models/research/object_detection/trainer.py", line 332, in train
saver=saver)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 767, in train
sv.stop(threads, close_summary_writer=True)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 792, in stop
stop_grace_period_secs=self._stop_grace_secs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 296, in stop_on_exception
yield
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 494, in run
self.run_loop()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 994, in run_loop
self._sv.global_step])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1124, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1321, in _do_run
options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: SecondStageFeatureExtractor/resnet_v1_101/block4/unit_3/bottleneck_v1/conv3/BatchNorm/gamma_1
[[Node: SecondStageFeatureExtractor/resnet_v1_101/block4/unit_3/bottleneck_v1/conv3/BatchNorm/gamma_1 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](SecondStageFeatureExtractor/resnet_v1_101/block4/unit_3/bottleneck_v1/conv3/BatchNorm/gamma_1/tag, SecondStageFeatureExtractor/resnet_v1_101/block4/unit_3/bottleneck_v1/conv3/BatchNorm/gamma/read)]]
[[Node: Loss/RPNLoss/map/TensorArray_2/_1353 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_5239_Loss/RPNLoss/map/TensorArray_2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
Caused by op u'SecondStageFeatureExtractor/resnet_v1_101/block4/unit_3/bottleneck_v1/conv3/BatchNorm/gamma_1', defined at:
File "object_detection/train.py", line 163, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "object_detection/train.py", line 159, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/home/ucfadng/tensorflow/models/research/object_detection/trainer.py", line 295, in train
global_summaries.add(tf.summary.histogram(model_var.op.name, model_var))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/summary/summary.py", line 192, in histogram
tag=tag, values=values, name=scope)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_logging_ops.py", line 129, in _histogram_summary
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2630, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1204, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InvalidArgumentError (see above for traceback): Nan in summary histogram for: SecondStageFeatureExtractor/resnet_v1_101/block4/unit_3/bottleneck_v1/conv3/BatchNorm/gamma_1
[[Node: SecondStageFeatureExtractor/resnet_v1_101/block4/unit_3/bottleneck_v1/conv3/BatchNorm/gamma_1 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](SecondStageFeatureExtractor/resnet_v1_101/block4/unit_3/bottleneck_v1/conv3/BatchNorm/gamma_1/tag, SecondStageFeatureExtractor/resnet_v1_101/block4/unit_3/bottleneck_v1/conv3/BatchNorm/gamma/read)]]
[[Node: Loss/RPNLoss/map/TensorArray_2/_1353 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_5239_Loss/RPNLoss/map/TensorArray_2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
My label map is:
item {
id: 1
name: 'rail'
}
The pipeline used was downloaded from: https://github.com/tensorflow/models/blob/master/research/object_detection/samples/configs/faster_rcnn_resnet101_coco.config
I have had a look at the TFRecord files and look as expected which leads be to think this is something to do with the initial dataset. But what specifically would cause this error?
My dataset consists of 250 clear images containing railway tracks. I have labelled sections of the tracks so each image has ~20/30 labelled objects.
So far the main thing I have tried is lowering learning rates and changing batch size but these did not solve the problem.
Any help in solving this would be very appreciated.
Cheers
I have now updated to version: 1.4.0 and this is no longer a problem. I did not change anything else. Perhaps this was a bug with version: 1.3.0. I will leave this up just in case anyone has the same issue.
This doesn't seem to be an error with tensorflow itself, i believe it has something to do with your system running out of memory.
Setting train to false in batch_norm (within your pipeline.config file) appears to overcome this problem.
It should look something like this:
batch_norm {
decay: 0.999
center: true
scale: true
epsilon: 0.001
train: false
}
Hope this helped.
So I try to use multiple GPUs with Keras. When I run training_utils.py with the example program (given as comments inside the training_utils.py code), I end up with ResourceExhaustedError. nvidia-smi tells me that barely one of the four GPUs are working. Using one GPU works fine for other programs.
TensorFlow 1.3.0
Keras 2.0.8
Ubuntu 16.04
CUDA/cuDNN 8.0/6.0
Question: Anyone have any idea whats going on here?
Console output:
(...)
2017-10-26 14:39:02.086838: W tensorflow/core/common_runtime/bfc_allocator.cc:277] ***************************************************************************************************x
2017-10-26 14:39:02.086857: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[128,55,55,256]
Traceback (most recent call last):
File "test.py", line 27, in
parallel_model.fit(x, y, epochs=20, batch_size=256)
File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/training.py", line 1631, in fit
validation_steps=validation_steps)
File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/training.py", line 1213, in _fit_loop
outs = f(ins_batch)
File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 2331, in call
**self.session_kwargs)
File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1124, in _run
feed_dict_tensor, options, run_metadata)
File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run
options, run_metadata)
File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[128,55,55,256]
[[Node: replica_1/xception/block3_sepconv2/separable_conv2d = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="VALID", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:1"](replica_1/xception/block3_sepconv2/separable_conv2d/depthwise, block3_sepconv2/pointwise_kernel/read/_2103)]]
[[Node: training/RMSprop/gradients/replica_0/xception/block10_sepconv2/separable_conv2d_grad/Conv2DBackpropFilter/_4511 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_25380_training/RMSprop/gradients/replica_0/xception/block10_sepconv2/separable_conv2d_grad/Conv2DBackpropFilter", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]]
Caused by op u'replica_1/xception/block3_sepconv2/separable_conv2d',
defined at: File "test.py", line 19, in
parallel_model = multi_gpu_model(model, gpus=2) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/utils/training_utils.py",
line 143, in multi_gpu_model
outputs = model(inputs) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/topology.py",
line 603, in call
output = self.call(inputs, **kwargs) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/topology.py",
line 2061, in call
output_tensors, _, _ = self.run_internal_graph(inputs, masks) File
"/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/topology.py",
line 2212, in run_internal_graph
output_tensors = _to_list(layer.call(computed_tensor, **kwargs)) File
"/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/layers/convolutional.py",
line 1221, in call
dilation_rate=self.dilation_rate) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py",
line 3279, in separable_conv2d
data_format=tf_data_format) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/nn_impl.py",
line 497, in separable_conv2d
name=name) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_nn_ops.py",
line 397, in conv2d
data_format=data_format, name=name) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py",
line 767, in apply_op
op_def=op_def) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py",
line 2630, in create_op
original_op=self._default_original_op, op_def=op_def) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py",
line 1204, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
ResourceExhaustedError (see above for traceback): OOM when allocating
tensor with shape[128,55,55,256] [[Node:
replica_1/xception/block3_sepconv2/separable_conv2d =
Conv2D[T=DT_FLOAT, data_format="NHWC", padding="VALID", strides=[1, 1,
1, 1], use_cudnn_on_gpu=true,
_device="/job:localhost/replica:0/task:0/gpu:1"](replica_1/xception/block3_sepconv2/separable_conv2d/depthwise,
block3_sepconv2/pointwise_kernel/read/_2103)]] [[Node:
training/RMSprop/gradients/replica_0/xception/block10_sepconv2/separable_conv2d_grad/Conv2DBackpropFilter/_4511
= _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0",
send_device="/job:localhost/replica:0/task:0/gpu:0",
send_device_incarnation=1,
tensor_name="edge_25380_training/RMSprop/gradients/replica_0/xception/block10_sepconv2/separable_conv2d_grad/Conv2DBackpropFilter",
tensor_type=DT_FLOAT,
_device="/job:localhost/replica:0/task:0/cpu:0"]]
EDIT (Added example code):
import tensorflow as tf
from keras.applications import Xception
from keras.utils import multi_gpu_model
import numpy as np
num_samples = 1000
height = 224
width = 224
num_classes = 100
with tf.device('/cpu:0'):
model = Xception(weights=None,
input_shape=(height, width, 3),
classes=num_classes)
parallel_model = multi_gpu_model(model, gpus=4)
parallel_model.compile(loss='categorical_crossentropy',
optimizer='rmsprop')
x = np.random.random((num_samples, height, width, 3))
y = np.random.random((num_samples, num_classes))
parallel_model.fit(x, y, epochs=20, batch_size=128)
When encountering OOM/ResourceExhaustedError on GPU I believe changing (Reducing) batch size is the right option to try at first.
For different GPU you may need different batch size based on the GPU
memory you have.
Recently I faced the similar type of problem, tweaked a lot to do the different type of experiment.
Here is the link to the question (also some tricks are included).
However, while reducing the size of the batch you may find that your training gets slower.
1 PC with 2 GPUs. To train 2 independent CNNs on 2 GPUs. I use followings to create graph for GPU:
with tf.device('/gpu:%d' % self.single_gpu):
self._create_placeholders()
self._build_conv_net()
self._create_cost()
self._creat_optimizer()
Training loop is not under th.device()
After starting 1st CNN training process, such as using GPU 1. Then, I start 2nd CNN training with GPU 0. I always get CUDA_ERROR_OUT_OF_MEMORY error, and could not start 2nd training process.
Running 2 independent training tasks assigned to 2 GPUs on same PC possible? If possible, what am I missing?
E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 164.06M (172032000 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
W tensorflow/core/common_runtime/bfc_allocator.cc:274] *******____******************_______________________________________________________________________
W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 384.00MiB. See logs for memory state.
Traceback (most recent call last):
File "/home/hl/anaconda3/envs/dl-conda-py36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1022, in _do_call
return fn(*args)
File "/home/hl/anaconda3/envs/dl-conda-py36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1004, in _run_fn
status, run_metadata)
File "/home/hl/anaconda3/envs/dl-conda-py36/lib/python3.6/contextlib.py", line 89, in exit
next(self.gen)
File "/home/hl/anaconda3/envs/dl-conda-py36/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InternalError: Dst tensor is not initialized.
[[Node: _recv_inputs/input_placeholder_0/_7 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:2", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_3__recv_inputs/input_placeholder_0", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:2"]]
[[Node: Mean/_15 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:2", send_device_incarnation=1, tensor_name="edge_414_Mean", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "mg_model_nvidia_gpu.py", line 491, in <module>
main()
File "mg_model_nvidia_gpu.py", line 482, in main
nvidia_cnn.train(data_generator, train_data, val_data)
File "mg_model_nvidia_gpu.py", line 307, in train
self.keep_prob: self.train_config.keep_prob})
File "/home/hl/anaconda3/envs/dl-conda-py36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 767, in run
run_metadata_ptr)
File "/home/hl/anaconda3/envs/dl-conda-py36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 965, in _run
feed_dict_string, options, run_metadata)
File "/home/hl/anaconda3/envs/dl-conda-py36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1015, in _do_run
target_list, options, run_metadata)
File "/home/hl/anaconda3/envs/dl-conda-py36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1035, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Dst tensor is not initialized.
[[Node: _recv_inputs/input_placeholder_0/_7 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:2", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_3__recv_inputs/input_placeholder_0", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:2"]()]]
[[Node: Mean/_15 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:2", send_device_incarnation=1, tensor_name="edge_414_Mean", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
By default, TensorFlow pre-allocates the whole memory of the GPU devices it has access to. Therefore no memory is available for the second process.
You can control this allocation using the config.gpu_options:
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.4
sess = tf.Session(config=config) as sess:
or you could attribute to your two processes a different card by using the os.environ["CUDA_VISIBLE_DEVICES"].
I have got a Tensorflow Tensor within Keras (< class 'tensorflow.python.framework.ops.Tensor'>) which needs to be transformed into a numpy.ndarray.
Is there an easy solution to this task?
I did time-comsuming research via Google and trial/error.
It seems that I have to use tensorflow sessions, but I don't wan't to use any tensorflow sessions, because using them I will receive errors like:
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating
TensorFlow device (/gpu:0) -> (device: 0, name: TITAN X (Pascal), pci bus id: 0000:09:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:1) -> (device: 1, name: TITAN X (Pascal), pci bus id: 0000:05:00.0)
Traceback (most recent call last):
File "mnist_cnn_model_api.py", line 348, in <module>
validation_data=(X_test, Y_test))
File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1596, in fit_generator
callbacks.on_epoch_end(epoch, epoch_logs)
File "/usr/local/lib/python2.7/dist-packages/keras/callbacks.py", line 76, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File "/home/dominik/keras/mnist/my_functions.py", line 737, in on_epoch_end
current_ist = current_is.eval()
File "/home/dominik/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 575, in eval
return _eval_using_default_session(self, feed_dict, self.graph, session)
File "/home/dominik/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3633, in _eval_using_default_session
return session.run(tensors, feed_dict)
File "/home/dominik/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 766, in run
run_metadata_ptr)
File "/home/dominik/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 964, in _run
feed_dict_string, options, run_metadata)
File "/home/dominik/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1014, in _do_run
target_list, options, run_metadata)
File "/home/dominik/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1034, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: You must feed a value for placeholder tensor 'input_1' with dtype float
[[Node: input_1 = Placeholder[dtype=DT_FLOAT, shape=[], _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
[[Node: strided_slice_1/_1 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_6_strided_slice_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
Caused by op u'input_1', defined at:
File "mnist_cnn_model_api.py", line 113, in <module>
inputs = Input(shape=input_shape)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 1198, in Input
input_tensor=tensor)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 1116, in __init__
name=self.name)
File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 309, in placeholder
x = tf.placeholder(dtype, shape=shape, name=name)
File "/home/dominik/.local/lib/python2.7/site-packages/tensorflow/python/ops/array_ops.py", line 1587, in placeholder
name=name)
File "/home/dominik/.local/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 2043, in _placeholder
name=name)
File "/home/dominik/.local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op
op_def=op_def)
File "/home/dominik/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2240, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/dominik/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1128, in __init__
self._traceback = _extract_stack()
InvalidArgumentError (see above for traceback): You must feed a value for placeholder tensor 'input_1' with dtype float
[[Node: input_1 = Placeholder[dtype=DT_FLOAT, shape=[], _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
[[Node: strided_slice_1/_1 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_6_strided_slice_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
I need this placeholder in line 113 - it doesn't make troubles when using Keras without any tensorflow session. But as soon as I am trying to use a tensorflow session (in order to use .eval() for transforming the tensor into numpy.ndarray) my code explodes...
I appreciate your help!