I want to have train/evaluate the ssd_mobile_v1_coco on my own dataset at the same time in Object Detection API.
However, when I simply try to do so, I am faced with GPU memory being nearly full and thus the evaluation script fails to start.
Here are the commands I use for training and then evaluation:
Training script is called in one terminal pane like this :
python3 train.py \
--logtostderr \
--train_dir=training_ssd_mobile_caltech \
--pipeline_config_path=ssd_mobilenet_v1_coco_2017_11_17/ssd_mobilenet_v1_focal_loss_coco.config
That runs fine, training works... then I try to run the evaluation script in the second terminal pane :
python3 eval.py \
--logtostderr \
--checkpoint_dir=training_ssd_mobile_caltech \
--eval_dir=eval_caltech \
--pipeline_config_path=ssd_mobilenet_v1_coco_2017_11_17/ssd_mobilenet_v1_focal_loss_coco.config
It fails with the following error :
python3 eval.py \
--logtostderr \
--checkpoint_dir=training_ssd_mobile_caltech \
--eval_dir=eval_caltech \
--pipeline_config_path=ssd_mobilenet_v1_coco_2017_11_17/ssd_mobilenet_v1_focal_loss_coco.config
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
2018-02-28 18:40:00.302271: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-02-28 18:40:00.412808: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:895] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-02-28 18:40:00.413217: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.835
pciBusID: 0000:01:00.0
totalMemory: 7.92GiB freeMemory: 93.00MiB
2018-02-28 18:40:00.413424: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1)
2018-02-28 18:40:00.957090: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 43.00M (45088768 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2018-02-28 18:40:00.957919: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 38.70M (40580096 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
INFO:tensorflow:Restoring parameters from training_ssd_mobile_caltech/model.ckpt-4775
INFO:tensorflow:Restoring parameters from training_ssd_mobile_caltech/model.ckpt-4775
2018-02-28 18:40:02.274830: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 8.17M (8566528 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2018-02-28 18:40:02.278599: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 8.17M (8566528 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2018-02-28 18:40:12.280515: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 8.17M (8566528 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2018-02-28 18:40:12.281958: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 8.17M (8566528 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2018-02-28 18:40:12.282082: W tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.75MiB. Current allocation summary follows.
2018-02-28 18:40:12.282160: I tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (256): Total Chunks: 190, Chunks in use: 190. 47.5KiB allocated for chunks. 47.5KiB in use in bin. 11.8KiB client-requested in use in bin.
2018-02-28 18:40:12.282251: I tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (512): Total Chunks: 70, Chunks in use: 70. 35.0KiB allocated for chunks. 35.0KiB in use in bin. 35.0KiB client-requested in use in bin.
[.......................................]2018-02-28 18:40:12.290959: I tensorflow/core/common_runtime/bfc_allocator.cc:684] Sum Total of in-use chunks: 29.83MiB
2018-02-28 18:40:12.290971: I tensorflow/core/common_runtime/bfc_allocator.cc:686] Stats:
Limit: 45088768
InUse: 31284736
MaxInUse: 32368384
NumAllocs: 808
MaxAllocSize: 5796864
2018-02-28 18:40:12.291022: W tensorflow/core/common_runtime/bfc_allocator.cc:277] **********************xx*********xx**_*__****______***********************************************xx
2018-02-28 18:40:12.291044: W tensorflow/core/framework/op_kernel.cc:1198] Resource exhausted: OOM when allocating tensor with shape[1,32,150,150] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
WARNING:root:The following classes have no ground truth examples: 1
/home/mm/models/research/object_detection/utils/metrics.py:144: RuntimeWarning: invalid value encountered in true_divide
num_images_correctly_detected_per_class / num_gt_imgs_per_class)
/home/mm/models/research/object_detection/utils/object_detection_evaluation.py:710: RuntimeWarning: Mean of empty slice
mean_ap = np.nanmean(self.average_precision_per_class)
/home/mm/models/research/object_detection/utils/object_detection_evaluation.py:711: RuntimeWarning: Mean of empty slice
mean_corloc = np.nanmean(self.corloc_per_class)
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1350, in _do_call
return fn(*args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1329, in _run_fn
status, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1,32,150,150] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_0/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 2, 2, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Preprocessor/sub, FeatureExtractor/MobilenetV1/Conv2d_0/weights/read)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[Node: Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/ClipToWindow/Gather/Gather_1/_469 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1068_Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/ClipToWindow/Gather/Gather_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "eval.py", line 146, in <module>
tf.app.run()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 124, in run
_sys.exit(main(argv))
File "eval.py", line 142, in main
FLAGS.checkpoint_dir, FLAGS.eval_dir)
File "/home/mm/models/research/object_detection/evaluator.py", line 240, in evaluate
save_graph_dir=(eval_dir if eval_config.save_graph else ''))
File "/home/mm/models/research/object_detection/eval_util.py", line 407, in repeated_checkpoint_run
save_graph_dir)
File "/home/mm/models/research/object_detection/eval_util.py", line 286, in _run_checkpoint_once
result_dict = batch_processor(tensor_dict, sess, batch, counters)
File "/home/mm/models/research/object_detection/evaluator.py", line 183, in _process_batch
result_dict = sess.run(tensor_dict)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1128, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1344, in _do_run
options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1363, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1,32,150,150] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_0/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 2, 2, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Preprocessor/sub, FeatureExtractor/MobilenetV1/Conv2d_0/weights/read)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[Node: Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/ClipToWindow/Gather/Gather_1/_469 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1068_Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/ClipToWindow/Gather/Gather_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Caused by op 'FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_0/Conv2D', defined at:
File "eval.py", line 146, in <module>
tf.app.run()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 124, in run
_sys.exit(main(argv))
File "eval.py", line 142, in main
FLAGS.checkpoint_dir, FLAGS.eval_dir)
File "/home/mm/models/research/object_detection/evaluator.py", line 161, in evaluate
ignore_groundtruth=eval_config.ignore_groundtruth)
File "/home/mm/models/research/object_detection/evaluator.py", line 72, in _extract_prediction_tensors
prediction_dict = model.predict(preprocessed_image, true_image_shapes)
File "/home/mm/models/research/object_detection/meta_architectures/ssd_meta_arch.py", line 334, in predict
preprocessed_inputs)
File "/home/mm/models/research/object_detection/models/ssd_mobilenet_v1_feature_extractor.py", line 112, in extract_features
scope=scope)
File "/home/mm/models/research/slim/nets/mobilenet_v1.py", line 232, in mobilenet_v1_base
scope=end_point)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 182, in func_with_args
return func(*args, **current_args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1057, in convolution
outputs = layer.apply(inputs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/layers/base.py", line 762, in apply
return self.__call__(inputs, *args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/layers/base.py", line 652, in __call__
outputs = self.call(inputs, *args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/layers/convolutional.py", line 167, in call
outputs = self._convolution_op(inputs, self.kernel)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/nn_ops.py", line 838, in __call__
return self.conv_op(inp, filter)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/nn_ops.py", line 502, in __call__
return self.call(inp, filter)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/nn_ops.py", line 190, in __call__
name=self.name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 639, in conv2d
data_format=data_format, dilations=dilations, name=name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3160, in create_op
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1625, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1,32,150,150] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_0/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 2, 2, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Preprocessor/sub, FeatureExtractor/MobilenetV1/Conv2d_0/weights/read)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[Node: Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/ClipToWindow/Gather/Gather_1/_469 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1068_Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/ClipToWindow/Gather/Gather_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Prior to initiating the eval.py TF training has all the GPU memory allocated in advance and therefore I cant figure out how to have both of them running at the same time, or at least have the ODA, run evaluation in specific intervals.
Therefore Is it possible in first place to have evaluation run simultaneously with training? if so How is it done ?
System information
What is the top-level directory of the model you are using: object_detection
Have I written custom code: not yet...
OS Platform and Distribution : Linux Ubuntu 16.04 LTS
TensorFlow installed from (source or binary): pip3 tensorflow-gpu
TensorFlow version (use command below): 1.5.0
CUDA/cuDNN version: 9.0/7.0
GPU model and memory: GTX 1080, 8Gb
One simple way to do this is to add CUDA_VISIBILE_DEVICES before your command
CUDA_VISIBLE_DEVICES="" python eval.py --logtostderr --pipeline_config_path=multires.config --checkpoint_dir=/train_dir/ --eval_dir=eval_dir/
which will prevent your evaluation script from seeing any GPU, and it should fall back to CPU automatically.
To force the eval job to run on your CPU (and prevent it from taking precious GPU-memory), create one virtualenv where you install tensorflow-gpu which you use for training (named e.g. virtual_tf_gpu), and another one where you install tensorflow WITHOUT gpu support (e.g. virtual_tf). Activate your two virtualenvs in two separate terminal windows and start training in your GPU-supported environment and evaluation in your CPU-supported environment.
Good luck!!!
Related
I am using Yolov3 by Ultralytics (PyTorch) to detect the behavior of cows in a video. The Yolov3 was trained to detect each individual cow in the video. Each image of the cow is cropped using the X and Y coordinates of the bounding box. Each image then goes through another model to determine whether they are sitting or standing. The second model was also trained with our own dataset. The second model uses Tensorflow and it's a very simple InceptionV3 model.
However, whenever I try to load both models, I am getting the following errors
RuntimeError: CUDA out of memory. Tried to allocate 50.00 MiB (GPU 0; 16.00 GiB total capacity; 427.42 MiB already allocated; 7.50 MiB free; 448.00 MiB reserved in total by PyTorch)
If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
If the second model is not loaded then the yolov3 (PyTorch) runs without any issue and it does not even use the whole 16GB of VRAM. Is the yolov3 is reserving the whole VRAM and not leaving anything for the tensorflow-based Inceptionv3? If yes, anyway of forcing torch to keep 2 GB VRAM aside?
Full code output here
>> python detectv2.py --weights best.pt --source outch06_20181022073801_0_10.avi
2022-06-01 16:02:40.975544: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-06-01 16:02:41.342394: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14123 MB memory: -> device: 0, name: Quadro RTX 5000, pci bus id: 0000:65:00.0, compute capability: 7.5
detectv2: weights=['best.pt'], source=outch06_20181022073801_0_10.avi, imgsz=[640, 640], conf_thres=0.25, iou_thres=0.45, max_det=1000, device=, view_img=False, save_txt=False, save_conf=False, save_crop=False, nosave=False, classes=None, agnostic_nms=False, augment=False, visualize=False, update=False, project=runs\detect, name=exp, exist_ok=False, line_thickness=3, hide_labels=False, hide_conf=False, half=False, dnn=False
Empty DataFrame
Columns: []
Index: []
YOLOv3 2022-5-16 torch 1.11.0 CUDA:0 (Quadro RTX 5000, 16384MiB)
Fusing layers...
Model Summary: 269 layers, 62546518 parameters, 0 gradients
Traceback (most recent call last):
File "detectv2.py", line 462, in <module>
main(opt)
File "detectv2.py", line 457, in main
run(**vars(opt))
File "C:\Users\sourav\Anaconda3\envs\yl37\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "detectv2.py", line 221, in run
model(torch.zeros(1, 3, *imgsz).to(device).type_as(next(model.model.parameters()))) # warmup
File "C:\Users\sourav\Anaconda3\envs\yl37\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\sourav\yolov3-master\models\common.py", line 357, in forward
y = self.model(im) if self.jit else self.model(im, augment=augment, visualize=visualize)
File "C:\Users\sourav\Anaconda3\envs\yl37\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\sourav\yolov3-master\models\yolo.py", line 127, in forward
return self._forward_once(x, profile, visualize) # single-scale inference, train
File "C:\Users\sourav\yolov3-master\models\yolo.py", line 150, in _forward_once
x = m(x) # run
File "C:\Users\sourav\Anaconda3\envs\yl37\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\sourav\yolov3-master\models\common.py", line 48, in forward_fuse
return self.act(self.conv(x))
File "C:\Users\sourav\Anaconda3\envs\yl37\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\sourav\Anaconda3\envs\yl37\lib\site-packages\torch\nn\modules\conv.py", line 447, in forward
return self._conv_forward(input, self.weight, self.bias)
File "C:\Users\sourav\Anaconda3\envs\yl37\lib\site-packages\torch\nn\modules\conv.py", line 444, in _conv_forward
self.padding, self.dilation, self.groups)
RuntimeError: CUDA out of memory. Tried to allocate 50.00 MiB (GPU 0; 16.00 GiB total capacity; 427.42 MiB already allocated; 7.50 MiB free; 448.00 MiB reserved in total by PyTorch)
If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
PyTorch did not occupy the GPU as I thought. It was the other way around. I was trying to initiate the TensorFlow model first which occupied the whole memory and did not leave anything for PyTorch.
The solution is here
For tensorflow 2.2+
For single GPU
import tensorflow as tf
gpu = tf.config.experimental.list_physical_devices('GPU')[0]
tf.config.experimental.set_memory_growth(gpu, True)
Details can be found in this post and this documentation.
I'm trying to restore Mobile-net V2 model using TensorFlow 1.7.0 version from this link, and using the following code, but I am getting an error.
import tensorflow as tf
dir(tf.contrib)
tf.reset_default_graph()
v1 = tf.get_variable("v1", shape=[3])
v2 = tf.get_variable("v2", shape=[5])
saver = tf.train.Saver()
with tf.Session() as sess:
saver = tf.train.import_meta_graph("/mobilenet_v2_1.4_224.ckpt.meta")
saver.restore(sess, "/mobilenet_v2_1.4_224.ckpt.data-00000-of-00001")
I am facing the following error which is related with TPU, where as I have support upto GPU:
Traceback (most recent call last):
File "/home/ext_user1/tensorflow_1.2.1_cp34/lib/python3.4/site-
packages/tensorflow/python/client/session.py", line 1327, in _do_call
return fn(*args)
File "/home/ext_user1/tensorflow_1.2.1_cp34/lib/python3.4/site-
packages/tensorflow/python/client/session.py", line 1310, in _run_fn
self._extend_graph()
File "/home/ext_user1/tensorflow_1.2.1_cp34/lib/python3.4/site-
packages/tensorflow/python/client/session.py", line 1358, in _extend_graph
graph_def.SerializeToString(), status)
File "/home/ext_user1/tensorflow_1.2.1_cp34/lib/python3.4/site-
packages/tensorflow/python/framework/errors_impl.py", line 516, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel
was registered to support Op 'ShutdownDistributedTPU' with these attrs.
Registered devices: [CPU], Registered kernels:
[[Node: ShutdownDistributedTPU =
ShutdownDistributedTPU_device="/job:tpu_worker/device:TPU_SYSTEM:0"]]
Please help me.
The fix for this is to clear the preset devices from the metagraph
saver = tf.train.import_meta_graph("/mobilenet_v2_1.4_224.ckpt.meta", clear_devices=True)
The metagraph is used for restoring a training session from a checkpoint. For prediction from this checkpoint the metagraph isn't needed. However if you want to keep training the model then importing the metagraph and clearing devices is the best way.
I'm running a seq2seq model with tf, the inference program runs well when loading parameters from checkpoint file using tf.train.Saver. But after exporting the graph with freeze_graph.py (using tf.framework.graph_util.convert_variables_to_constants()), and import with tf.import_graph_def in the inference program, it got OOM problem.
Here is a part of error log:
W tensorflow/core/common_runtime/bfc_allocator.cc:274] ****************************************************************************************************
W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 4.0KiB. See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:983] Internal: Dst tensor is not initialized.
E tensorflow/core/common_runtime/executor.cc:594] Executor failed to create kernel. Internal: Dst tensor is not initialized.
[[Node: embedding_attention_seq2seq/embedding_attention_decoder/attention_decoder/AttnV_0 = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [1024] values: -0.016628871 -0.2054652 -0.045054652...>, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
Traceback (most recent call last):
File "inference.py", line 88, in console_main
result = list(inference(source_sentence))
File "inference.py", line 54, in inference
for sequence in result:
File "/data/experiment/decoder.py", line 115, in search_best_sequence
State.batch_predict(self.session, self.model, self.context, beam)
File "/data/experiment/decoder.py", line 82, in batch_predict
state_list[0].depth)
File "/data/experiment/seq2seq_model.py", line 452, in batch_feed_decoder
log_softmax, attns, state = session.run(output_fetch, input_feed)
File "/home/.conda/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 767, in run
run_metadata_ptr)
File "/home/.conda/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 966, in _run
feed_dict_string, options, run_metadata)
File "/home/.conda/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1016, in _do_run
target_list, options, run_metadata)
File "/home/.conda/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1036, in _do_call
raise type(e)(node_def, op, message)
InternalError: Dst tensor is not initialized.
[[Node: embedding_attention_seq2seq/embedding_attention_decoder/attention_decoder/AttnV_0 = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [1024] values: -0.016628871 -0.2054652 -0.045054652...>, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
Caused by op u'embedding_attention_seq2seq/embedding_attention_decoder/attention_decoder/AttnV_0', defined at:
File "inference.py", line 169, in <module>
tf.app.run()
File "/home/.conda/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 44, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "inference.py", line 165, in main
console_main(session)
File "inference.py", line 66, in console_main
model = create_model(session, False)
File "/data/experiment/model.py", line 145, in create_model
tensor_name_pickle=tensor_name_pickle)
File "/data/experiment/seq2seq_model.py", line 106, in __init__
tf.import_graph_def(graph_def, name="")
File "/home/.conda/lib/python2.7/site-packages/tensorflow/python/framework/importer.py", line 287, in import_graph_def
op_def=op_def)
File "/home/.conda/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2395, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/.conda/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1264, in __init__
self._traceback = _extract_stack()
InternalError (see above for traceback): Dst tensor is not initialized.
[[Node: embedding_attention_seq2seq/embedding_attention_decoder/attention_decoder/AttnV_0 = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [1024] values: -0.016628871 -0.2054652 -0.045054652...>, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
I thought it might cause by the memory issue of tf.Constant. Does someone have experience with this problem?
I had the same issue but when trying to load and run the inference from a C++ application using the C API. After a lot of twiddling and testing it appeared the culprit was the frozen graph and freeze_graph.py itself. It's probably a bug of some kind. There are actually multiple issue reports on github's TF repo, but they were just closed due to lack of activity, e.g. here and here. I guess apparent bugs of model freezing aren't of any priority.
In my case the model .pb file was around 500mb and it took around 10Gb of RAM while running a session. Not only did it occupy an insane amount of RAM, it was actually orders of magnitudes slower that way.
When I switched to loading just a SavedModel directory everything went to normal. I'm not sure how to achieve that in python, but for C code I replaced a TF_GraphImportGraphDef() call with TF_LoadSessionFromSavedModel().
I used TF v1.14.0. The library is built with Bazel by me, not the stock version. I could provide some details here and there if anybody was interested. Just not sure where to start, I had many trials and errors.
I'm testing Google Cloud ML for speeding up my ML model using Tensorflow.
Unfortunately, it seems like Google Cloud ML is extremely slow. My Mainstream-Level PC is at least 10x faster than Google Cloud ML.
I doubt it uses GPU, so I did a test. I modified a sample code to force using GPU.
diff --git a/mnist/trainable/trainer/task.py b/mnist/trainable/trainer/task.py
index 9acb349..a64a11d 100644
--- a/mnist/trainable/trainer/task.py
+++ b/mnist/trainable/trainer/task.py
## -131,11 +131,12 ## def run_training():
images_placeholder, labels_placeholder = placeholder_inputs(
FLAGS.batch_size)
- # Build a Graph that computes predictions from the inference model.
- logits = mnist.inference(images_placeholder, FLAGS.hidden1, FLAGS.hidden2)
+ with tf.device("/gpu:0"):
+ # Build a Graph that computes predictions from the inference model.
+ logits = mnist.inference(images_placeholder, FLAGS.hidden1, FLAGS.hidden2)
- # Add to the Graph the Ops for loss calculation.
- loss = mnist.loss(logits, labels_placeholder)
+ # Add to the Graph the Ops for loss calculation.
+ loss = mnist.loss(logits, labels_placeholder)
# Add to the Graph the Ops that calculate and apply gradients.
train_op = mnist.training(loss, FLAGS.learning_rate)
This training code works at my PC (gcloud beta ml local train ...) but not in cloud. It gives errors like this:
"Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 239, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 43, in run
sys.exit(main(sys.argv[:1] + flags_passthrough))
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 235, in main
run_training()
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 177, in run_training
sess.run(init)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 766, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 964, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1014, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1034, in _do_call
raise type(e)(node_def, op, message)
InvalidArgumentError: Cannot assign a device to node 'softmax_linear/biases': Could not satisfy explicit device specification '/device:GPU:0' because no devices matching that specification are registered in this process; available devices: /job:localhost/replica:0/task:0/cpu:0
Colocation Debug Info:
Colocation group had the following types and devices:
ApplyGradientDescent: CPU
Identity: CPU
Assign: CPU
Variable: CPU
[[Node: softmax_linear/biases = Variable[container="", dtype=DT_FLOAT, shape=[10], shared_name="", _device="/device:GPU:0"]()]]
Does Google Cloud ML support GPU?
GPUs are now in Beta and all Cloud ML customers have access.
Here are the docs for using GPUs with Cloud ML.
I‘m studying the tensoflow, and want to test the example of slim. When I command ./scripts/train_lenet_on_mnist.sh, The program run to eval_image_classifier give a Type Error, The Error information as follows:
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so.8.0 locally
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Evaluating /tmp/lenet-model/model.ckpt-20002
INFO:tensorflow:Starting evaluation at 2016-11-09-02:55:57
I tensorflow/core/common_runtime/gpu/gpu_device.cc:951] Found device 0 with properties:
name: Quadro K5000
major: 3 minor: 0 memoryClockRate (GHz) 0.7055
pciBusID 0000:03:00.0
Total memory: 3.94GiB
Free memory: 3.61GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:972] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Quadro K5000, pci bus id: 0000:03:00.0)
INFO:tensorflow:Executing eval ops
INFO:tensorflow:Executing eval_op 1/100
INFO:tensorflow:Error reported to Coordinator: <class 'TypeError'>, Fetch argument dict_values([<tf.Tensor 'accuracy/update_op:0' shape=() dtype=float32>, <tf.Tensor 'recall_at_5/update_op:0' shape=() dtype=float32>]) has invalid type <class 'dict_values'>, must be a string or Tensor. (Can not convert a dict_values into a Tensor or Operation.)
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 218, in init
fetch, allow_tensor=True, allow_operation=True))
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 2455, in as_graph_element
return self._as_graph_element_locked(obj, allow_tensor, allow_operation)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 2547, in _as_graph_element_locked
% (type(obj).name, types_str))
TypeError: Can not convert a dict_values into a Tensor or Operation.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "eval_image_classifier.py", line 191, in <module>
tf.app.run()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv[:1] + flags_passthrough))
File "eval_image_classifier.py", line 187, in main
variables_to_restore=variables_to_restore)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/evaluation.py", line 359, in evaluate_once
global_step=global_step)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/evaluation.py", line 260, in evaluation
sess.run(eval_op, eval_op_feed_dict)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 717, in run
run_metadata_ptr)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 902, in _run
fetch_handler = _FetchHandler(self._graph, fetches, feed_dict_string)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 358, in init
self._fetch_mapper = _FetchMapper.for_fetch(fetches)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 189, in for_fetch
return _ElementFetchMapper(fetches, contraction_fn)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 222, in init
% (fetch, type(fetch), str(e)))
TypeError: Fetch argument dict_values([<tf.Tensor 'accuracy/update_op:0' shape=() dtype=float32>, <tf.Tensor 'recall_at_5/update_op:0' shape=() dtype=float32>]) has invalid type <class 'dict_values'>, must be a string or Tensor. (Can not convert a dict_values into a Tensor or Operation.)
I don't know what happened to the program, I didnot revised any code, just download the code package from github, and for datadownload and train, which give correct result.Is there any help me? I am waiting online. Thanks
The problem is the compatible of python2 and python3. As I used python3 for interpretation, but the the Keys From a Dictionary is different from python2 and python3.
In Python2, simply calling keys() on a dictionary object will return what you expect, however, in Python3, keys() no longer returns a list, but a view object, so the TypeError can be avoided and compatibility can be maintained by simply converting the dict_keys object into a list which can then be indexed as normal in both Python2 and Python3.
I edited the eval_image_classifier using eval_op=list(names_to_updates.values()), then it can work perfectly.
The other python3 change for eval_image_classifier.py is
for name, value in names_to_values.iteritems():
to
for name, value in names_to_values.items():