Look at the code:
import tensorflow as tf
with tf.device('/gpu:0'):
with tf.device('/cpu:0'):
x = tf.constant(0,name='x')
x = x * 2
y = x + 2
config = tf.ConfigProto(log_device_placement=True)
with tf.Session(config=config) as sess:
sess.run(y)
When you run it, you will get the result.
mul: (Mul): /job:localhost/replica:0/task:0/cpu:0
2017-08-11 21:38:23.953846: I c:\tf_jenkins\home\workspace\release-win\m\windows
-gpu\py\35\tensorflow\core\common_runtime\simple_placer.cc:847] mul: (Mul)/job:l
ocalhost/replica:0/task:0/cpu:0
add: (Add): /job:localhost/replica:0/task:0/gpu:0
2017-08-11 21:38:23.954846: I c:\tf_jenkins\home\workspace\release-win\m\windows
-gpu\py\35\tensorflow\core\common_runtime\simple_placer.cc:847] add: (Add)/job:l
ocalhost/replica:0/task:0/gpu:0
add/y: (Const): /job:localhost/replica:0/task:0/gpu:0
2017-08-11 21:38:23.954846: I c:\tf_jenkins\home\workspace\release-win\m\windows
-gpu\py\35\tensorflow\core\common_runtime\simple_placer.cc:847] add/y: (Const)/j
ob:localhost/replica:0/task:0/gpu:0
mul/y: (Const): /job:localhost/replica:0/task:0/cpu:0
2017-08-11 21:38:23.954846: I c:\tf_jenkins\home\workspace\release-win\m\windows
-gpu\py\35\tensorflow\core\common_runtime\simple_placer.cc:847] mul/y: (Const)/j
ob:localhost/replica:0/task:0/cpu:0
x: (Const): /job:localhost/replica:0/task:0/cpu:0
2017-08-11 21:38:23.954846: I c:\tf_jenkins\home\workspace\release-win\m\windows
-gpu\py\35\tensorflow\core\common_runtime\simple_placer.cc:847] x: (Const)/job:l
ocalhost/replica:0/task:0/cpu:0
It means the mul runs on cpu and add runs on gpu. So I draw a conclusion that where does ops or tensors define where does ops or tensors run.
And when I view the Inception, I confused.
with slim.arg_scope([slim.variables.variable], device='/cpu:0'):
# Calculate the loss for one tower of the ImageNet model. This
# function constructs the entire ImageNet model but shares the
# variables across all towers.
loss = _tower_loss(images_splits[i], labels_splits[i], num_classes,
scope, reuse_variables)
The tower_loss define on cpu, which means every gpu will run loss on cpu according to my conclusion. But I think every gpu should run a replica
on gpu. So does I misunderstand?
parent device assignment are overwritten by child device assignment.
In the following code the function _tower_loss has another device assignment inside /gpu:i (if you look at the implementation). Loss are computed using the gpus, but loss are gathered and averaged in cpu.
loss = _tower_loss(images_splits[i], labels_splits[i], num_classes,
scope, reuse_variables)
Related
I am running windows 10, core i7-8700 cpu, gtx geforce 1660 ti GPU.
When training models, gpu utilization is very low (5-10% at max, sometimes lower).
Even is network is five layers. CPU utilization on the other hand is 30% and above.
Please check the following:
The CUDA and CuDNN versions match. According to the statistics, it may very well use the CPU instead of GPU while training. You can try to see if your GPU is available below option 2.
If the former is solved, you may want to increase the batch_size, in case there is a very small batch size. It may be the case that TensorFlow pre-allocates a small amount of GPU to your training.
For step 1, in order to verify that the video card is both available and used, make use of the next lines of code:
import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
tf.debugging.set_log_device_placement(True)
# Create some tensors
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b)
print(c)
The print should contain(along the result) the following information:
Executing op MatMul in device
/job:localhost/replica:0/task:0/device:GPU:0
Running AWS SageMaker with a custom model, the TrainingJob fails with an Algorithm Error when using Keras plus a Tensorflow backend in multi-gpu configuration:
from keras.utils import multi_gpu_model
parallel_model = multi_gpu_model(model, gpus=K)
parallel_model.compile(loss='categorical_crossentropy',
optimizer='rmsprop')
parallel_model.fit(x, y, epochs=20, batch_size=256)
This simple parallel model loading will fail. There is no further error or exception from CloudWatch logging. This configuration works properly on local machine with 2x NVIDIA GTX 1080, same Keras Tensorflow backend.
According to SageMaker documentation and tutorials the multi_gpu_model utility will work ok when Keras backend is MXNet, but I did not find any mention when the backend is Tensorflow with the same multi gpu configuration.
[UPDATE]
I have updated the code with the suggested answer below, and I'm adding some logging before the TrainingJob hangs
This logging repeats twice
2018-11-27 10:02:49.878414: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3
2018-11-27 10:02:49.878462: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-27 10:02:49.878471: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2 3
2018-11-27 10:02:49.878477: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y Y Y
2018-11-27 10:02:49.878481: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N Y Y
2018-11-27 10:02:49.878486: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: Y Y N Y
2018-11-27 10:02:49.878492: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3: Y Y Y N
2018-11-27 10:02:49.879340: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:0 with 14874 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1b.0, compute capability: 7.0)
2018-11-27 10:02:49.879486: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:1 with 14874 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1c.0, compute capability: 7.0)
2018-11-27 10:02:49.879694: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:2 with 14874 MB memory) -> physical GPU (device: 2, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1d.0, compute capability: 7.0)
2018-11-27 10:02:49.879872: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:3 with 14874 MB memory) -> physical GPU (device: 3, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0)
Before there is some logging info about each GPU, that repeats 4 times
2018-11-27 10:02:46.447639: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 3 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:1e.0
totalMemory: 15.78GiB freeMemory: 15.37GiB
According to the logging all the 4 GPUs are visible and loaded in the Tensorflow Keras backend. After that no application logging follows, the TrainingJob status is inProgress for a while, after that it becomes Failed with the same Algorithm Error.
Looking at CloudWatch logging I can see some metrics at work. Specifically GPU Memory Utilization, CPU Utilization are ok, while GPU utilization is 0%.
[UPDATE]
Due to a known bug on Keras that is about saving a multi gpu model, I'm using this override of the multi_gpu_model utility in keras.utils
from keras.layers import Lambda, concatenate
from keras import Model
import tensorflow as tf
def multi_gpu_model(model, gpus):
#source: https://github.com/keras-team/keras/issues/8123#issuecomment-354857044
if isinstance(gpus, (list, tuple)):
num_gpus = len(gpus)
target_gpu_ids = gpus
else:
num_gpus = gpus
target_gpu_ids = range(num_gpus)
def get_slice(data, i, parts):
shape = tf.shape(data)
batch_size = shape[:1]
input_shape = shape[1:]
step = batch_size // parts
if i == num_gpus - 1:
size = batch_size - step * i
else:
size = step
size = tf.concat([size, input_shape], axis=0)
stride = tf.concat([step, input_shape * 0], axis=0)
start = stride * i
return tf.slice(data, start, size)
all_outputs = []
for i in range(len(model.outputs)):
all_outputs.append([])
# Place a copy of the model on each GPU,
# each getting a slice of the inputs.
for i, gpu_id in enumerate(target_gpu_ids):
with tf.device('/gpu:%d' % gpu_id):
with tf.name_scope('replica_%d' % gpu_id):
inputs = []
# Retrieve a slice of the input.
for x in model.inputs:
input_shape = tuple(x.get_shape().as_list())[1:]
slice_i = Lambda(get_slice,
output_shape=input_shape,
arguments={'i': i,
'parts': num_gpus})(x)
inputs.append(slice_i)
# Apply model on slice
# (creating a model replica on the target device).
outputs = model(inputs)
if not isinstance(outputs, list):
outputs = [outputs]
# Save the outputs for merging back together later.
for o in range(len(outputs)):
all_outputs[o].append(outputs[o])
# Merge outputs on CPU.
with tf.device('/cpu:0'):
merged = []
for name, outputs in zip(model.output_names, all_outputs):
merged.append(concatenate(outputs,
axis=0, name=name))
return Model(model.inputs, merged)
This works ok on local 2x NVIDIA GTX 1080 / Intel Xeon / Ubuntu 16.04. It will fails on SageMaker Training Job.
I have posted this issue on AWS Sagemaker forum in
TrainingJob custom algorithm with Keras backend and multi GPU
SageMaker Fails when using Multi-GPU with
keras.utils.multi_gpu_model
[UPDATE]
I have slightly modified the tf.session code adding some initializers
with tf.Session() as session:
K.set_session(session)
session.run(tf.global_variables_initializer())
session.run(tf.tables_initializer())
and now at least I can see that one GPU (I assume device gpu:0) is used from the instance metrics. The multi-gpu does not work anyways.
This might not be the best answer for your problem, but this is what I am using for a multi-gpu model with Tensorflow backend. First i initialize using:
def setup_multi_gpus():
"""
Setup multi GPU usage
Example usage:
model = Sequential()
...
multi_model = multi_gpu_model(model, gpus=num_gpu)
multi_model.fit()
About memory usage:
https://stackoverflow.com/questions/34199233/how-to-prevent-tensorflow-from-allocating-the-totality-of-a-gpu-memory
"""
import tensorflow as tf
from keras.utils.training_utils import multi_gpu_model
from tensorflow.python.client import device_lib
# IMPORTANT: Tells tf to not occupy a specific amount of memory
from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto()
config.gpu_options.allow_growth = True # dynamically grow the memory used on the GPU
sess = tf.Session(config=config)
set_session(sess) # set this TensorFlow session as the default session for Keras.
# getting the number of GPUs
def get_available_gpus():
local_device_protos = device_lib.list_local_devices()
return [x.name for x in local_device_protos if x.device_type == 'GPU']
num_gpu = len(get_available_gpus())
print('Amount of GPUs available: %s' % num_gpu)
return num_gpu
Then i call
# Setup multi GPU usage
num_gpu = setup_multi_gpus()
and create a model.
...
After which you're able to make it a multi GPU model.
multi_model = multi_gpu_model(model, gpus=num_gpu)
multi_model.compile...
multi_model.fit...
The only thing here that is different from what you are doing is the way Tensorflow is initializing the GPU's. I can't imagine it being the problem, but it might be worth trying out.
Good luck!
Edit: I noticed sequence to sequence not being able to work with multi GPU. Is that the type of model you are trying to train?
I apologize for the slow response.
It seems there are a lot of threads that are running in parallel, and I want to link them together, so that other individuals who have the same issue can see the progress and discussion going on.
https://forums.aws.amazon.com/thread.jspa?messageID=881541
https://forums.aws.amazon.com/thread.jspa?messageID=881540
https://github.com/aws/sagemaker-python-sdk/issues/512
There a few questions in regards to this.
What version of TensorFlow and Keras?
I am not too sure what is causing this problem. Does your container have all of the needed dependencies such as CUDA and etc? https://www.tensorflow.org/install/gpu
Were you able to train using single GPU with Keras?
My code:
import tensorflow as tf
def main():
with tf.device('/gpu:0'):
a = tf.Variable(1)
init_a = tf.global_variables_initializer()
with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as sess:
sess.run(init_a)
if __name__ == '__main__':
main()
The error:
InvalidArgumentError (see above for traceback): Cannot assign a device
for operation 'Variable': Could not satisfy explicit device
specification '/device:GPU:0' because no supported kernel for GPU
devices is available.
Does this mean tf can't pin Variable to GPU?
Here is another thread which related to this topic.
int32 types are not (as of January 2018) comprehensively supported on GPUs. I believe the full error would say something like:
InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'Variable': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
Colocation Debug Info:
Colocation group had the following types and devices:
Assign: CPU
Identity: CPU
VariableV2: CPU
[[Node: Variable = VariableV2[container="", dtype=DT_INT32, shape=[], shared_name="", _device="/device:GPU:0"]()]]
And it's the DT_INT32 there that is causing you trouble, since you explicitly requested that the variable be placed on GPU but there is no GPU kernel for the corresponding operation and dtype.
If this was just a test program and in reality you need variables of another type, such as float32, you should be fine. For example:
import tensorflow as tf
with tf.device('/gpu:0'):
# Providing 1. instead of 1 as the initial value will result
# in a float32 variable. Alternatively, you could explicitly
# provide the dtype argument to tf.Variable()
a = tf.Variable(1.)
init_a = tf.global_variables_initializer()
with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as sess:
sess.run(init_a)
Alternatively, you could choose to explicitly place int32 variables on CPU, or just not specify any device at all and let TensorFlow's device placement select GPU where appropriate. For example:
import tensorflow as tf
v_int = tf.Variable(1, name='intvar')
v_float = tf.Variable(1., name='floatvar')
init = tf.global_variables_initializer()
with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as sess:
sess.run(init)
Which will show that 'intvar' is placed on CPU while 'floatvar' is on GPU using some log lines like:
floatvar: (VariableV2)/job:localhost/replica:0/task:0/device:GPU:0
intvar: (VariableV2)/job:localhost/replica:0/task:0/device:CPU:0
Hope that helps.
This means that Tensorflow cannot find the device you specified.
I assume you wanted to specify that your code is executed on your GPU 0.
The correct syntax would be:
with tf.device('/device:GPU:0'):
The shortform you are using is only allowed for the CPU.
You can also check this answer here: How to get current available GPUs in tensorflow?
It shows how to list the GPU devices that are recognized by TF.
And this lists the syntax: https://www.tensorflow.org/tutorials/using_gpu
I have something equivalent to a sparse softmax:
...
with tf.device('/gpu:0'):
indices = tf.placeholder(tf.int32, [None, dimsize])
self._W = weight_variable([self._num_nodes, input_layer_size])
self._b = bias_variable([self._num_nodes])
sampled_W = tf.transpose(tf.nn.embedding_lookup(self._W, indices), [0,2,1]) # [batchsize, inputlayersize, dim1size]
sampled_b = tf.nn.embedding_lookup(self._b, indices) # [batchsize, dim1size]
...
However, when I enable placement logging, I see multiple instances of the gradients being placed on the CPU, e.g.:
gradients/.../embedding_lookup_1_grad/Size: /job:localhost/replica:0/task:0/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:819] gradients/.../embedding_lookup_1_grad/Size: /job:localhost/replica:0/task:0/cpu:0
This happens no matter the optimizer I choose. Am I missing something here?
If you use
tf.Session(config=tf.ConfigProto(allow_soft_placement=False))
you should get an error. That's because embedding_lookup isn't implemented on the GPU currently.
I installed tensorflow with gpu, cuda 7.0 and cudnn 6.5. When I import tensorflow it works well.
I am trying to run a simple matrix multiplication on Tensorflow and it doesn't want to use my gpu though it seems to recognize it. I have this issue on my computer with a nvidia geforce 970m and on a cluster with two titan Z.
My first code is :
import tensorflow as tf
import numpy as np
size=100
#I create 2 matrix
mat1 = np.random.random_sample([size, size])*100
mat2 = np.random.random_sample([size, size])*100
a = tf.constant(mat1)
b = tf.constant(mat2)
c = tf.matmul(a, b)
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
sess.run(c)
This code works and the result is :
Const_1: /job:localhost/replica:0/task:0/gpu:0
I tensorflow/core/common_runtime/simple_placer.cc:289] Const_1: /job:localhost/replica:0/task:0/gpu:0
Const: /job:localhost/replica:0/task:0/gpu:0
I tensorflow/core/common_runtime/simple_placer.cc:289] Const: /job:localhost/replica:0/task:0/gpu:0
MatMul: /job:localhost/replica:0/task:0/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:289] MatMul: /job:localhost/replica:0/task:0/cpu:0
So in my way, tensorflow uses my gpu to create constant but not for matmul (that is weird). Then, I force the gpu like this :
with tf.device("/gpu:0"):
a = tf.constant(mat1)
b = tf.constant(mat2)
c = tf.matmul(a, b)
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
sess.run(c)
And Tensorflow returns :
InvalidArgumentError: Cannot assign a device to node 'MatMul': Could not satisfy explicit device specification '/gpu:0'
If someone have the same problem or an idea, I will be glad to read your answer !
I do not have enough reputation to comment, I have come across a similar issue, my question is here
TensorFlow: critical graph operations assigned to cpu rather than gpu