no supported kernel for GPU devices is available for SparseTensorDenseMatMul_grad - tensorflow

I meet a issue when building a model with tf.sparse_tensor_dense_matmul op in my graph. Part of the error info pasted as below,
Does that mean there is no GPU kernel support to compute the gradient of "SparseTensorDenseMatMul_grad"? I can build the model successfully with "allow_soft_placement=Ture" in the session config. However, I need all the computation keep on GPU for some special reason. Does anyone know how to fixed this issue? Or I need to implement the CUDA kernel of this op by myself? Thanks a lot.
tensorflow.python.framework.errors.InvalidArgumentError: Cannot assign a device to node 'gradients/softmax_linear/SparseTensorDenseMatMul/SparseTensorDenseMatMul_grad/Slice_1': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
[[Node: gradients/softmax_linear/SparseTensorDenseMatMul/SparseTensorDenseMatMul_grad/Slice_1 = Slice[Index=DT_INT32, T=DT_INT64, _device="/device:GPU:0"](Placeholder_2, gradients/softmax_linear/SparseTensorDenseMatMul/SparseTensorDenseMatMul_grad/Slice_1/begin, gradients/softmax_linear/SparseTensorDenseMatMul/SparseTensorDenseMatMul_grad/Slice_1/size)]]
Caused by op u'gradients/softmax_linear/SparseTensorDenseMatMul/SparseTensorDenseMatMul_grad/Slice_1', defined at:

Related

How do I train deep learning neural network that contains embedding layer using GPU?

I'm getting a InvalidArgumentError on my embedding layer:
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=2 requested_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' assigned_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' resource_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
GatherV2: GPU CPU
Cast: GPU CPU
Const: GPU CPU
ResourceSparseApplyAdagradV2: CPU
_Arg: GPU CPU
ReadVariableOp: GPU CPU
Colocation members, user-requested devices, and framework assigned devices, if any:
model_6_user_embedding_embedding_lookup_readvariableop_resource (_Arg) framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
adagrad_adagrad_update_1_update_0_resourcesparseapplyadagradv2_accum (_Arg) framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
model_6/User-Embedding/embedding_lookup/ReadVariableOp (ReadVariableOp)
model_6/User-Embedding/embedding_lookup/axis (Const)
model_6/User-Embedding/embedding_lookup (GatherV2)
gradient_tape/model_6/User-Embedding/embedding_lookup/Shape (Const)
gradient_tape/model_6/User-Embedding/embedding_lookup/Cast (Cast)
Adagrad/Adagrad/update_1/update_0/ResourceSparseApplyAdagradV2 (ResourceSparseApplyAdagradV2) /job:localhost/replica:0/task:0/device:GPU:0
[[{{node model_6/User-Embedding/embedding_lookup/ReadVariableOp}}]] [Op:__inference_train_function_2997]
Link to google colab:
https://colab.research.google.com/drive/1ZN1HzSTTfvA_zstuI-EsKjw7Max1f73v?usp=sharing
It's a really simple neural network, and data is available to download from Kaggle - you could just drag and drop into colabs to get it working.
I've also tried to set soft device placement = True
tf.config.set_soft_device_placement(True) but that doesn't seem to have worked.
From the error log, it looks like MirroredStrategy has assigned the Embedding lookup operation to GPU (which is GPU incompatible and I can see why), and I was hoping that tf.config.set_soft_device_placement(True) would have asked Tensorflow to use CPU instead but it feels like that's ignored.
Has anyone seen this problem before and know of a workaround?
Found a similar issue for TF1.14:
https://github.com/tensorflow/tensorflow/issues/31318
Looks like MirroredStrategy can't support training embedding layers using momentum-based optimisers.
Cloning the above notebook and using RMSprop (with momentum=0) seemed to work:
https://colab.research.google.com/drive/13MXa8Q96M6uzlkK3K_M7vmQfclL59eRj?usp=sharing
I'll use RMSProp with no momentum for now until this issue is fixed. The error message certainly hasn't helped!

How to disable GPUs in H2O AutoML

When I run an experiment with H2O AutoML, I got the error: "terminate called after throwing an instance of 'thrust::system::system_error' what(): parallel_for failed: invalid resource handle". This error message comes from XGBoost and it is because of the GPU limit exceed.
While I'm using the regular XGBoost, I set the cuda visible devices parameter to blank to disable GPUs. However, this arguments seems to be ignored in H2O AutoML - XGBoost implementation.
import os
os.environ["CUDA_VISIBLE_DEVICES"] = ""
Currently, the only xgboost can be run on gpu in H2O AutoML.
The question it that anybody knows how to disable GPUs in H2O AutoML?
As a workaround, I excluded XGBoost algorithm to run my experiment for now. The trouble is passed when I exclude XGBoost but I do not want to give up the power of XGBoost.
from h2o.automl import H2OAutoML
model = H2OAutoML(max_runtime_secs = 60*60*2, exclude_algos = ["XGBoost"])
That's definitely an oversight and we will need to add the ability to turn on/off and/or specify the GPU. I opened a ticket for this. I wonder if there's a way to temporarily disable the GPU at the system level (outside of H2O/Python) in the meantime? Thanks for the report!

Using dedicated graphics card only for tensorflow

Is it possible to have Tensorflow utilize the dedicated graphics card for training models while the integrated graphics card handles the non-ML tasks?
It is basically possible, assuming compatible hardware. Right now, supported hardware are major CPUs, Nvidia GPUs, and Google’s TPUs.
Choosing where to compute is called device pinning or placement. You can see how to actually do it with the current API under the “Placing operations on different devices” section of the current documentation.
Stolen from the above link:
# Operations created outside either context will run on the "best possible"
# device. For example, if you have a GPU and a CPU available, and the operation
# has a GPU implementation, TensorFlow will choose the GPU.
weights = tf.random_normal(...)
with tf.device("/device:CPU:0"):
# Operations created in this context will be pinned to the CPU.
img = tf.decode_jpeg(tf.read_file("img.jpg"))
with tf.device("/device:GPU:0"):
# Operations created in this context will be pinned to the GPU.
result = tf.matmul(weights, img)
You mention the integrated graphic card. In theory possible to use it, but is it supported? It may one day, with the new XLA architecture of TensorFlow (still alpha at this stage).

How can I know which operation can not be placed on GPU in tensorflow?

How can I know which operation can not be placed on GPU in tensorflow? Is there a place that I can check?
Thanks
You can check kernels (i.e. implementations on devices) for ops which are located at this directory: https://github.com/tensorflow/tensorflow/tree/r0.11/tensorflow/core/kernels/
For example, suppose you would like to know whether softmax can be placed on GPU. You can navigate to the kernel of softmax: https://github.com/tensorflow/tensorflow/blob/r0.11/tensorflow/core/kernels/softmax_op.cc. You will find the following code:
REGISTER_KERNEL_BUILDER(
Name("Softmax").Device(DEVICE_GPU).TypeConstraint<Eigen::half>("T"),
SoftmaxOp<GPUDevice, Eigen::half>);
This means there is a kernel for softmax on GPU with type float16. The prerequisite is that you have to build your tensorflow with GPU enabled.

Configuring Tensorflow to use all CPU's

Reading :
https://www.tensorflow.org/versions/r0.10/resources/faq.html it states :
Does TensorFlow make use of all the devices (GPUs and CPUs) available
on my machine?
TensorFlow supports multiple GPUs and CPUs. See the how-to
documentation on using GPUs with TensorFlow for details of how
TensorFlow assigns operations to devices, and the CIFAR-10 tutorial
for an example model that uses multiple GPUs.
Note that TensorFlow only uses GPU devices with a compute capability
greater than 3.5.
Does this mean Tensorflow can automatically make use of all CPU's on given machine or does it ned to be explicitly configured ?
CPUs are used via a "device" which is just a threadpool. You can control the number of threads if you feel like you need more:
sess = tf.Session(config=tf.ConfigProto(
intra_op_parallelism_threads=NUM_THREADS))