I implement GAN in Tensorflow Estimator format. Here's the complete code in gist.
The model can be trained normally. However, it seems to hang at model.evaluate forever. The log after training is like below.
INFO:tensorflow:Starting evaluation at 2018-12-03-02:19:06
INFO:tensorflow:Graph was finalized.
2018-12-03 02:19:06.956750: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-12-03 02:19:06.956781: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-12-03 02:19:06.956786: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2018-12-03 02:19:06.956790: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2018-12-03 02:19:06.956912: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10464 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
INFO:tensorflow:Restoring parameters from /tensorlog/wad/acgan/a51fbd6/model.ckpt-10002
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
If I use tf.estimator.train_and_evaluate, the evaluated accuracy will always be 0.5.
I've already checked my tfrecords file and it's not empty, images and labels can be read without problem. I also tried using same tfrecords file for both training and evaluating but still got the same result.
It seems to me that the tensorflow model may have problem loading GAN's weights from checkpoints. If it's true, how to solve that problem?
It turns out it's because the training parameters in dropout and batch_normalization prevent the weights from restoration.
Fix the value of training to either true or false solve the problem.
Related
It is my first time training a model on GPU. I am using tensorflow. I am getting an error: InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run AssignVariableOp: Dst tensor is not initialized. [Op:AssignVariableOp]
I have tried for solutions like reducing batch size, use tf-nightly but to no avail. I am using Nvidia GeForce GTX 1080 8 Gb. I am trying to train an image classification model using Keras Application(Xception).
Running AWS SageMaker with a custom model, the TrainingJob fails with an Algorithm Error when using Keras plus a Tensorflow backend in multi-gpu configuration:
from keras.utils import multi_gpu_model
parallel_model = multi_gpu_model(model, gpus=K)
parallel_model.compile(loss='categorical_crossentropy',
optimizer='rmsprop')
parallel_model.fit(x, y, epochs=20, batch_size=256)
This simple parallel model loading will fail. There is no further error or exception from CloudWatch logging. This configuration works properly on local machine with 2x NVIDIA GTX 1080, same Keras Tensorflow backend.
According to SageMaker documentation and tutorials the multi_gpu_model utility will work ok when Keras backend is MXNet, but I did not find any mention when the backend is Tensorflow with the same multi gpu configuration.
[UPDATE]
I have updated the code with the suggested answer below, and I'm adding some logging before the TrainingJob hangs
This logging repeats twice
2018-11-27 10:02:49.878414: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3
2018-11-27 10:02:49.878462: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-27 10:02:49.878471: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2 3
2018-11-27 10:02:49.878477: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y Y Y
2018-11-27 10:02:49.878481: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N Y Y
2018-11-27 10:02:49.878486: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: Y Y N Y
2018-11-27 10:02:49.878492: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3: Y Y Y N
2018-11-27 10:02:49.879340: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:0 with 14874 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1b.0, compute capability: 7.0)
2018-11-27 10:02:49.879486: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:1 with 14874 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1c.0, compute capability: 7.0)
2018-11-27 10:02:49.879694: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:2 with 14874 MB memory) -> physical GPU (device: 2, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1d.0, compute capability: 7.0)
2018-11-27 10:02:49.879872: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:3 with 14874 MB memory) -> physical GPU (device: 3, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0)
Before there is some logging info about each GPU, that repeats 4 times
2018-11-27 10:02:46.447639: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 3 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:1e.0
totalMemory: 15.78GiB freeMemory: 15.37GiB
According to the logging all the 4 GPUs are visible and loaded in the Tensorflow Keras backend. After that no application logging follows, the TrainingJob status is inProgress for a while, after that it becomes Failed with the same Algorithm Error.
Looking at CloudWatch logging I can see some metrics at work. Specifically GPU Memory Utilization, CPU Utilization are ok, while GPU utilization is 0%.
[UPDATE]
Due to a known bug on Keras that is about saving a multi gpu model, I'm using this override of the multi_gpu_model utility in keras.utils
from keras.layers import Lambda, concatenate
from keras import Model
import tensorflow as tf
def multi_gpu_model(model, gpus):
#source: https://github.com/keras-team/keras/issues/8123#issuecomment-354857044
if isinstance(gpus, (list, tuple)):
num_gpus = len(gpus)
target_gpu_ids = gpus
else:
num_gpus = gpus
target_gpu_ids = range(num_gpus)
def get_slice(data, i, parts):
shape = tf.shape(data)
batch_size = shape[:1]
input_shape = shape[1:]
step = batch_size // parts
if i == num_gpus - 1:
size = batch_size - step * i
else:
size = step
size = tf.concat([size, input_shape], axis=0)
stride = tf.concat([step, input_shape * 0], axis=0)
start = stride * i
return tf.slice(data, start, size)
all_outputs = []
for i in range(len(model.outputs)):
all_outputs.append([])
# Place a copy of the model on each GPU,
# each getting a slice of the inputs.
for i, gpu_id in enumerate(target_gpu_ids):
with tf.device('/gpu:%d' % gpu_id):
with tf.name_scope('replica_%d' % gpu_id):
inputs = []
# Retrieve a slice of the input.
for x in model.inputs:
input_shape = tuple(x.get_shape().as_list())[1:]
slice_i = Lambda(get_slice,
output_shape=input_shape,
arguments={'i': i,
'parts': num_gpus})(x)
inputs.append(slice_i)
# Apply model on slice
# (creating a model replica on the target device).
outputs = model(inputs)
if not isinstance(outputs, list):
outputs = [outputs]
# Save the outputs for merging back together later.
for o in range(len(outputs)):
all_outputs[o].append(outputs[o])
# Merge outputs on CPU.
with tf.device('/cpu:0'):
merged = []
for name, outputs in zip(model.output_names, all_outputs):
merged.append(concatenate(outputs,
axis=0, name=name))
return Model(model.inputs, merged)
This works ok on local 2x NVIDIA GTX 1080 / Intel Xeon / Ubuntu 16.04. It will fails on SageMaker Training Job.
I have posted this issue on AWS Sagemaker forum in
TrainingJob custom algorithm with Keras backend and multi GPU
SageMaker Fails when using Multi-GPU with
keras.utils.multi_gpu_model
[UPDATE]
I have slightly modified the tf.session code adding some initializers
with tf.Session() as session:
K.set_session(session)
session.run(tf.global_variables_initializer())
session.run(tf.tables_initializer())
and now at least I can see that one GPU (I assume device gpu:0) is used from the instance metrics. The multi-gpu does not work anyways.
This might not be the best answer for your problem, but this is what I am using for a multi-gpu model with Tensorflow backend. First i initialize using:
def setup_multi_gpus():
"""
Setup multi GPU usage
Example usage:
model = Sequential()
...
multi_model = multi_gpu_model(model, gpus=num_gpu)
multi_model.fit()
About memory usage:
https://stackoverflow.com/questions/34199233/how-to-prevent-tensorflow-from-allocating-the-totality-of-a-gpu-memory
"""
import tensorflow as tf
from keras.utils.training_utils import multi_gpu_model
from tensorflow.python.client import device_lib
# IMPORTANT: Tells tf to not occupy a specific amount of memory
from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto()
config.gpu_options.allow_growth = True # dynamically grow the memory used on the GPU
sess = tf.Session(config=config)
set_session(sess) # set this TensorFlow session as the default session for Keras.
# getting the number of GPUs
def get_available_gpus():
local_device_protos = device_lib.list_local_devices()
return [x.name for x in local_device_protos if x.device_type == 'GPU']
num_gpu = len(get_available_gpus())
print('Amount of GPUs available: %s' % num_gpu)
return num_gpu
Then i call
# Setup multi GPU usage
num_gpu = setup_multi_gpus()
and create a model.
...
After which you're able to make it a multi GPU model.
multi_model = multi_gpu_model(model, gpus=num_gpu)
multi_model.compile...
multi_model.fit...
The only thing here that is different from what you are doing is the way Tensorflow is initializing the GPU's. I can't imagine it being the problem, but it might be worth trying out.
Good luck!
Edit: I noticed sequence to sequence not being able to work with multi GPU. Is that the type of model you are trying to train?
I apologize for the slow response.
It seems there are a lot of threads that are running in parallel, and I want to link them together, so that other individuals who have the same issue can see the progress and discussion going on.
https://forums.aws.amazon.com/thread.jspa?messageID=881541
https://forums.aws.amazon.com/thread.jspa?messageID=881540
https://github.com/aws/sagemaker-python-sdk/issues/512
There a few questions in regards to this.
What version of TensorFlow and Keras?
I am not too sure what is causing this problem. Does your container have all of the needed dependencies such as CUDA and etc? https://www.tensorflow.org/install/gpu
Were you able to train using single GPU with Keras?
python model_main.py --model_dir=training/ --pipeline_config_path=training/faster_rcnn_inception_v2
_pets.config
WARNING:tensorflow:Estimator's model_fn (<function create_model_fn.<locals>.model_fn at 0x00000155CFDB5C80>) includes params argument, but params are not passed to Estimator.
WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards.
WARNING:root:Variable [SecondStageBoxPredictor/BoxEncodingPredictor/biases] is available in checkpoint, but has an incompatible shape with model variable.
WARNING:root:Variable [SecondStageBoxPredictor/BoxEncodingPredictor/weights] is available in checkpoint, but has an incompatible shape with model variable.
WARNING:root:Variable [SecondStageBoxPredictor/ClassPredictor/biases] is available in checkpoint, but has an incompatible shape with model variable.
WARNING:root:Variable [SecondStageBoxPredictor/ClassPredictor/weights] is available in checkpoint, but has an incompatible shape with model variable.
WARNING:root:Variable [global_step] is not available in checkpoint
D:\ProgramFiles\Anaconda\envs\Eneger\lib\site-packages\tensorflow\python\ops\gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may
consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
2018-07-24 05:19:44.209751: I C:\users\nwani\_bazel_nwani\mmtm6wb6\execroot\org_tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorF
low binary was not compiled to use: AVX AVX2
2018-07-24 05:19:45.534609: I C:\users\nwani\_bazel_nwani\mmtm6wb6\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1356] Found device 0 with properties:
name: GeForce GTX 1060 major: 6 minor: 1 memoryClockRate(GHz): 1.6705
pciBusID: 0000:01:00.0
totalMemory: 6.00GiB freeMemory: 4.97GiB
2018-07-24 05:19:45.571588: I C:\users\nwani\_bazel_nwani\mmtm6wb6\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1435] Adding visible gpu devices: 0
2018-07-24 05:19:53.604693: I C:\users\nwani\_bazel_nwani\mmtm6wb6\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:923] Device interconnect StreamExecutor with stren
gth 1 edge matrix:
2018-07-24 05:19:53.615439: I C:\users\nwani\_bazel_nwani\mmtm6wb6\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:929] 0
2018-07-24 05:19:53.622183: I C:\users\nwani\_bazel_nwani\mmtm6wb6\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:942] 0: N
2018-07-24 05:19:53.643331: I C:\users\nwani\_bazel_nwani\mmtm6wb6\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1053] Created TensorFlow device (/job:localhost/re
plica:0/task:0/device:GPU:0 with 4730 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060, pci bus id: 0000:01:00.0, compute capability: 6.1)
2018-07-24 05:20:58.481169: E C:\users\nwani\_bazel_nwani\mmtm6wb6\execroot\org_tensorflow\tensorflow\stream_executor\cuda\cuda_event.cc:49] Error polling for event status: failed to query e
vent: **CUDA_ERROR_ILLEGAL_INSTRUCTION**
2018-07-24 05:20:58.511921: F C:\users\nwani\_bazel_nwani\mmtm6wb6\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\gpu_event_mgr.cc:208] Unexpected Event status: 1
There is an open Issue on tensorflow github since March:
https://github.com/tensorflow/tensorflow/issues/17747
But seems that the error appears randomly, not all times, can you confirm that? The last answer is 10 days ago, so i think they're still working on it.
L-
I would first check that your versions of cuda and cudnn are the compatible ones for your tensorflow versions. For example cuda toolkit 9.0 and not cuda toolkit 9.1 or cudnn 7.0 and not 7.05.
After that I would update the drivers of the GPU form the Official Nvidia website.
I would not only check that you have CUDA 9.0 and cuDNN 7.0, but I would also make sure that the CuDNN version you selected and installed is the one providing enabled for CUDA 9.0 (there are versions of CuDNN for CUDA 8.0, 9.0 and 9.1).
I am running tensorflow 1.7 on my local machine which contains 2 GPU's each of around 8 GB.
Training(train.py) the Object detection is working when i am using the model 'faster_rcnn_resnet101_coco'. But When i tried to run 'faster_rcnn_nas_coco' its showing an error of 'Resource exhausted'
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py:736: __init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2018-05-02 16:14:53.963966: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0, 1
2018-05-02 16:14:53.964071: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-05-02 16:14:53.964083: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917] 0 1
2018-05-02 16:14:53.964091: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0: N Y
2018-05-02 16:14:53.964097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 1: Y N
2018-05-02 16:14:53.964566: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7385 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:02:00.0, compute capability: 6.1)
2018-05-02 16:14:53.966360: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 7552 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1070, pci bus id: 0000:03:00.0, compute capability: 6.1)
INFO:tensorflow:Restoring parameters from training/model.ckpt-0
INFO:tensorflow:Restoring parameters from training/model.ckpt-0
Limit: 7744048333
InUse: 7699536896
MaxInUse: 7699551744
NumAllocs: 10260
MaxAllocSize: 4076716032
2018-05-02 16:16:52.223943: W tensorflow/core/common_runtime/bfc_allocator.cc:279] ***********************************************************************************x****************
2018-05-02 16:16:52.223967: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at depthwise_conv_op.cc:358 : Resource exhausted: OOM when allocating tensor with shape[64,672,9,9] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
I am not sure !! If its using both GPU's because it showing in usage memory as '7699536896'. After going through the train.py, I also tried
python train.py \
--logtostderr \
--worker_replicas=2 \
--pipeline_config_path=training/faster_rcnn_resnet101_coco.config \
--train_dir=training
If 2 GPU's are available, Does tensorflow by default choose both of them ? or does it need any arguments ?
We use the number of GPU specified by worker_replicas. For the NASNet case, try decreasing the batch-size to make the network fit on GPU.
The system configuration is as follows:
Ubuntu 16.04,
cuda 9.1,
cudnn 7.0.5,
nvidia driver 390.30,
GTX 1050 TI gpu,
tensorflow-gpu 1.7rc1 and 1.5,
configuration file and train.py is stock from the tensorflow distribution, and
the trained model being used is ssd_mobilenet_v1_coco_2017_11_17.
The following is collected from the putty terminal session:
(od) gennis#AI:~/models/research$ ./train_raccoon.sh
WARNING:tensorflow:From /home/dennis/.virtualenvs/od/local/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/datasets/base.py:198: retry (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Use the retry module or similar alternatives.
WARNING:tensorflow:From /home/dennis/models/research/object_detection/trainer.py:228: create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.create_global_step
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:Summary name /clone_loss is illegal; using clone_loss instead.
WARNING:tensorflow:From /home/dennis/.virtualenvs/od/local/lib/python2.7/site-packages/tensorflow/contrib/slim/python/slim/learning.py:736: init (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2018-03-23 16:59:01.725435: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1355] Found device 0 with properties:
name: GeForce GTX 1050 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.455
pciBusID: 0000:01:00.0
totalMemory: 3.94GiB freeMemory: 3.89GiB
2018-03-23 16:59:01.725484: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1434] Adding visible gpu devices: 0
2018-03-23 16:59:02.090533: I tensorflow/core/common_runtime/gpu/gpu_device.cc:922] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-03-23 16:59:02.090592: I tensorflow/core/common_runtime/gpu/gpu_device.cc:928] 0
2018-03-23 16:59:02.090601: I tensorflow/core/common_runtime/gpu/gpu_device.cc:941] 0: N
2018-03-23 16:59:02.090801: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1052] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3631 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
INFO:tensorflow:Restoring parameters from /home/dennis/models/research/ssd_mobilenet_v1_coco_2017_11_17/model.ckpt
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path temp/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
At this point the system crashes and I have to power down the system and then power it back up(reboot).
In addition to my training data and configuration, I also used the model and data from an article by Dat Tran called "How to train your object detector with Tensorflow's object detection API" and got the same results.
I have been able to run the mnist example and other tests that show that tensorflow-gpu is working.
I am not sure what to do next. Is there additional information I can gather to help further diagnose the problem?
Any advice would be greatly appreciated,
Thanks