I am having problems executing a simple Tensorflow model that worked well yesterday. I suspect, the problem in its entirety relates to the error given
Blas GEMM launch failed
In the console it says,
tensorflow/core/common_runtime/gpu/gpu_util.cc:343] CPU->GPU Memcpy failed
My impression is that this may relate to my CUDA installation based on this
TensorFlow: Blas GEMM launch failed
however, I can't see how to run the simpleCUBLAS examples. I am completely new to CUDA.
I have 4 1080ti GPUs (Ubuntu 16.04, TensorFlow 1.3.0) and I have not identified any zombie processes taking up GPU memory. Any help is greatly appreciated.
So I found the answer after days of going mad. I first ran this
I did this:
cd /usr/local/cuda/samples/7_CUDALibraries/simpleCUBLAS
make
./simpleCUBLAS
to check my CUBLAS installation. It returned CUBLAS INITIALIZATION FAILED!!!
So next I did this (based on advice)
sudo rm -f ~/.nv
And it worked. Hope this saves someone else. Seems easy when you see it.
The other thing that is worth mentioning is that this problem also threw this error occasionally:
tensorflow/stream_executor/cuda/cuda_dnn.cc:385] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
tensorflow/stream_executor/cuda/cuda_dnn.cc:352] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
tensorflow/core/kernels/conv_ops.cc:605] Check failed: stream->parent()->GetConvolveAlgorithms(&algorithms)
This was cryptic - everybody suggested it was a memory issue and sure enough, my GPUs got hogged by python during the initiation of my TF model. But it was the CUBLAS error that led me to the solution.
Related
As the title says. I installed the CUDA toolkit, did the conda source forge isntall of tensorflow gpu. And it works perfectly in my jupyter notebook. I am running a CNN and it takes about 10 seconds every epoch. I do the same thing for my pycharm IDE and I get and error:
File "C:\Users\StackoverflowUser\Anaconda3\envs\Working_Gpu_Environment\lib\site-packages\tensorflow_core\python\client\session.py", line 1472, in __call__
run_metadata_ptr)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node conv2d_1/convolution}}]]
[[metrics/acc/Mean/_127]]
(1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node conv2d_1/convolution}}]]
0 successful operations.
0 derived errors ignored.
I am thinking it has to be something to do with the environment I am using especially because the last file the error is triggered from is from the pat where my virtual environment is stored. However, my lack of expertise is holding me back, I tried looking for the default environment my Notebook uses but could not figure out how. Are there any other ways?
I've downloaded the newest Nsight Compute profiling tool and I want to use it to benchmark Tensorflow applications. The code I'm using is here. It runs perfectly fine when I execute it and when I benchmark it with nvprof ./mnist.py it had no problem at all. However, when I try to run it with command sudo ./nv-nsight-cu-cli [path to the file] I get the following error:
ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory
I suspect that nv-nsight-cu-cli somehow didn't recognized the environment variable at all. Is there any fix around?
You need to search for differences in both environments:
env variables
LD_LIBRARY_PATH
/etc/ld.so.conf
/etc/ld.so.conf.d/*
cuBLAS
Is installation complete/not broken?
Is it installed at the same location on both machines?
Versions
...
You can start with locate libcublas.so on both machines to see if there's a difference. Alternatively, you can strace -f -e open the program to check where it tries to libcublas.so from.
Your error has (for now) nothing to do with GPUs: libcublas.so.9.0 can just not be found. Find it, find why Tensorflow can not find it and your problem will be solved.
It appears that GP100 is not supported by the tool at this moment.
The answer is found here:
Nsight Compute only supports Pascal (other than GP100) and later GPUs.
I'm running a program to process some data, and I inference both a TensorFlow model and a Pytorch model.
When inferencing either of the models everything works fine. However, when I add the pytorch input my program crashes with this error:
2018-05-14 12:55:05.525251: E tensorflow/stream_executor/cuda/cuda_dnn.cc:385] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2018-05-14 12:55:05.525280: F tensorflow/core/kernels/conv_ops.cc:717] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo<T>(), &algorithms)
Note that this already happens before I do anything with Pytorch. No models are loaded, nothing is put on GPU, no devices are checked.
Does anyone know what might be going wrong, how to fix it, and if there are some parameters I can change?
Something I already tried is disabling the PyTorch backend using this code:
import torch.backends.cudnn as cudnn
cudnn.enabled = False
But unfortunately this does not help...
You'll find in the NVIDIA Forums some references of cuBLAS not playing well with several Python processes interacting with it at the same time. This is referenced in this 1 year old issue for Tensorflow, but it should be the same for any multiple-PyTorch client applications interfacing with GPU through CUDA - and cuBLAS, to be more specific. cuBLAS handles weren't being properly initialized, somehow due to a mixture of issues related to on-disk caching and RAM utilization being too large.
The solution was both to delete the on-disk cache for cuBLAS,
sudo rm -rf ~/.nv
and restrict the amount of memory usage for nets.
I have built caffe with only cpu support. Is the command 'caffe.set_mode_cpu() ' only used when we have built with gpu support so that we can switch to cpu when needed? I thought I might need it just to make sure that Caffe is using my cpu but I guess the build takes care of that. Also is this command required even when I have built with cpu support only?
Error I get-
WARNING: Logging before InitGoogleLogging() is written to STDERR
E1220 14:26:00.833413 17923 common.cpp:117] Cannot create Cublas handle. Cublas won't be available.
E1220 14:26:00.833684 17923 common.cpp:124] Cannot create Curand generator. Curand won't be available.
E1220 14:26:00.833871 17923 common.cpp:128] Cannot create cuDNN handle. cuDNN won't be available.
F1220 14:26:00.834089 17923 _caffe.cpp:61] Check failed: error == cudaSuccess (35 vs. 0) CUDA driver version is insufficient for CUDA runtime version
*** Check failure stack trace: ***
Aborted (core dumped)
Problem posted on caffe users group
This is my output of 'ccmake ..' . It says that CPU_ONLY is off even after removing the comment on CPU flag. How do I make it build with CPU for sure?
To build Caffe, I used cmake .. instead of make as I got convert_imageset.bin error. So I followed the instructions in the link and I got it to build properly.
Now I was looking at my cmake output and realised that the "CPU_ONLY" option was set to off. So i followed this link where i used "cmake -DCPU_ONLY=ON" to set it ON.
But I'm still getting Cuda error even when cmake option "CPU_ONLY=ON" is there. I am not sure why it is still being built with GPU?
Looking at my cmake output again, I found this error-
CMake Error at CMakeLists.txt:85 (add_dependencies): The dependency target "pycaffe" of target "pytest does not exist.
Is this fine since anyways we have to do make pycaffe to build with python?
I tried to retrain (new images, new classes) on top of the pretrained inception model, I therefor followed the instructions of the inception readme:
https://github.com/tensorflow/models/tree/master/inception#how-to-construct-a-new-dataset-for-retraining
I successfully built and ran build_image_data using bazel, as described in the tutorial. Afterwards I successfully built inception_train using bazel:
~/tensorflowmodels/models/inception# bazel build inception/inception_train
INFO: Found 1 target...
Target //inception:inception_train up-to-date (nothing to build)
INFO: Elapsed time: 0.073s, Critical Path: 0.00s
However, running bazel-bin/inception/inception_train I always get the following:
~/tensorflowmodels/models/inception# bazel-bin/inception/inception_train --train_dir="/" --validation_dir="/" --data_dir="/images_jpg/" --pretrained_model_checkpoint_path="/tensorflowmodels/models/inception/inception-v3/" --fine_tune=True --initial_learning_rate=0.001 --input_queue_memory_factor=1 --num_gpus=1
-bash: bazel-bin/inception/inception_train: No such file or directory
Naturally I would say it's by 99.9999% chance a typo. So then I tried to run inception_train.py with python. I had to change some import locations, and it finally ran with the parameters. However the script stops without any error messages after the initialization of the CUDA drivers.
Any help on how to solve this (or perform fine tuning / retraining with inception) would be very much appreciated.
tensorflow version: 0.9rc0
CPU: Xeon 5, 24 cores
GPU: Grid K2 8 GB
OS: Ubuntu 14.04
BTW I posted this already as an Github issue (which was closed, since it would be more a case for Stack Overflow).