How do I train deep learning neural network that contains embedding layer using GPU? - tensorflow

I'm getting a InvalidArgumentError on my embedding layer:
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=2 requested_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' assigned_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' resource_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
GatherV2: GPU CPU
Cast: GPU CPU
Const: GPU CPU
ResourceSparseApplyAdagradV2: CPU
_Arg: GPU CPU
ReadVariableOp: GPU CPU
Colocation members, user-requested devices, and framework assigned devices, if any:
model_6_user_embedding_embedding_lookup_readvariableop_resource (_Arg) framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
adagrad_adagrad_update_1_update_0_resourcesparseapplyadagradv2_accum (_Arg) framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
model_6/User-Embedding/embedding_lookup/ReadVariableOp (ReadVariableOp)
model_6/User-Embedding/embedding_lookup/axis (Const)
model_6/User-Embedding/embedding_lookup (GatherV2)
gradient_tape/model_6/User-Embedding/embedding_lookup/Shape (Const)
gradient_tape/model_6/User-Embedding/embedding_lookup/Cast (Cast)
Adagrad/Adagrad/update_1/update_0/ResourceSparseApplyAdagradV2 (ResourceSparseApplyAdagradV2) /job:localhost/replica:0/task:0/device:GPU:0
[[{{node model_6/User-Embedding/embedding_lookup/ReadVariableOp}}]] [Op:__inference_train_function_2997]
Link to google colab:
https://colab.research.google.com/drive/1ZN1HzSTTfvA_zstuI-EsKjw7Max1f73v?usp=sharing
It's a really simple neural network, and data is available to download from Kaggle - you could just drag and drop into colabs to get it working.
I've also tried to set soft device placement = True
tf.config.set_soft_device_placement(True) but that doesn't seem to have worked.
From the error log, it looks like MirroredStrategy has assigned the Embedding lookup operation to GPU (which is GPU incompatible and I can see why), and I was hoping that tf.config.set_soft_device_placement(True) would have asked Tensorflow to use CPU instead but it feels like that's ignored.
Has anyone seen this problem before and know of a workaround?

Found a similar issue for TF1.14:
https://github.com/tensorflow/tensorflow/issues/31318
Looks like MirroredStrategy can't support training embedding layers using momentum-based optimisers.
Cloning the above notebook and using RMSprop (with momentum=0) seemed to work:
https://colab.research.google.com/drive/13MXa8Q96M6uzlkK3K_M7vmQfclL59eRj?usp=sharing
I'll use RMSProp with no momentum for now until this issue is fixed. The error message certainly hasn't helped!

Related

Using a NCHW trained GAN on PC with CPU only?

I am playing around with Progressive Growing of Gans network from Karras et al. (NVIDIA). I trained the network on a different dataset on 2x 1080 Ti cards, using the NCHW mode for all convolutions as seen here.
Now I have a trained model and I want to use the code snippet from the project's readme to import the trained network - which is mentioned in the Importing and using pre-trained networks section.
However, when I try to run it on a PC with a CPU only, the network fails with the error:
InvalidArgumentError (see above for traceback): Conv2DCustomBackpropInputOp only supports NHWC.
[[node Gs/Run/Gs/cond/8x8/Conv0_up/conv2d_transpose (defined at <string>:93) ]]
I know that the CPU does not have NCHW implemented, but I would like to know how can I work around this? I see several options out of which I do not want any:
The first thing that comes to mind is to just use the NHWC mode for generating a new image, even though the network was trained in NCHW. This, however, if I am correct, will mess up the layer's weights, since it was trained on NCHW.
I do not want to train the whole network in NHWC mode, since it should be slower than NCHW? I do not use the cuDNN from NVidia, does it mean that those two modes are the same speed then, according to this?
What else can I do? Thank you!

Is it necessary to install GPU libraries on Google Colaboratory before using GPU?

I've been trying to use GPU with tensorflow on Colaboratory, but when I do
a = tf.constant(np.random.rand(1000,20000))
b = tf.constant(np.random.rand(20000,1000))
with tf.device('/device:GPU:0'):
c_gpu = tf.matmul(a,b)
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
print(sess.run(c_gpu))
the devices of the operations are not printed, although the result of the operation is. I suspect it is not using the GPU because I had measured the times of the matrix multiplication for both GPU and CPU and compared them.
No, it is not necessary.
In Colaboratory you should check if Runtime -> Change runtime type the parameter Hardware accelerator is GPU.
Then to test if Tensorflow use it, you can see this interesting sample, it works for me:
https://stackoverflow.com/a/43703735/9250875

Is there any way to fuse fully connected layer(gemm) and activation layer(relu/sigmoid) on gpu in dnn?

Usually one layer in dnn consists of MatMul, BiasAdd, Relu, cuBlas provides Gemm for MatMul, and we can do BiasAdd and Relu in another kernel for GPU. They are two GPU lanuch calls, is there any way to fuse them all togather and make them just one? I looked into cuBlas, cudnn, but not found anything. I think it's not difficult because BiasAdd and Relu are just element-wise operaions, and fusion makes it more efficient.
Here is the backgroud:
I am working on a online prediction service which is multi dnn model ensemble. By profiling my program, I found out that both my CPU and GPU is not fully utilized, but requests blocks on GPU-related function call (like lanuchKernel). It seems like there's a big lock in libcuda. I am using tensorflow, XLA enabled, so I use nvprof and tensorflow HLO to visialize GPU-call, and there'are only dot and fused(which is biasadd and relu) operations. Although kernel fusion is done, there're still too many lanuchKernel calls, and GPU utilization is only 60%. I tried multi cuda context in one process, the improvement is trivial.
By the way, I am using one single GPU, Tesla P100.

Tensorflow behaves differently between GPU and CPU

I modified the Tensorflow tutorial example for beginners by adding a hidden layer. The recognition rate obtained by running on GPU is ~95%. But when running in cpu-only mode, I can get ~40%. Does anybody know why the same python code behaves so differently? Thanks.
Weiguang

no supported kernel for GPU devices is available for SparseTensorDenseMatMul_grad

I meet a issue when building a model with tf.sparse_tensor_dense_matmul op in my graph. Part of the error info pasted as below,
Does that mean there is no GPU kernel support to compute the gradient of "SparseTensorDenseMatMul_grad"? I can build the model successfully with "allow_soft_placement=Ture" in the session config. However, I need all the computation keep on GPU for some special reason. Does anyone know how to fixed this issue? Or I need to implement the CUDA kernel of this op by myself? Thanks a lot.
tensorflow.python.framework.errors.InvalidArgumentError: Cannot assign a device to node 'gradients/softmax_linear/SparseTensorDenseMatMul/SparseTensorDenseMatMul_grad/Slice_1': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
[[Node: gradients/softmax_linear/SparseTensorDenseMatMul/SparseTensorDenseMatMul_grad/Slice_1 = Slice[Index=DT_INT32, T=DT_INT64, _device="/device:GPU:0"](Placeholder_2, gradients/softmax_linear/SparseTensorDenseMatMul/SparseTensorDenseMatMul_grad/Slice_1/begin, gradients/softmax_linear/SparseTensorDenseMatMul/SparseTensorDenseMatMul_grad/Slice_1/size)]]
Caused by op u'gradients/softmax_linear/SparseTensorDenseMatMul/SparseTensorDenseMatMul_grad/Slice_1', defined at: