Nvidia GeForce 210 compute issue on Ubuntu 18.04 - gpu

I am using ubuntu 18.04 (I have dual booted windows with ubuntu 18.04).
nvidia-smi
This is the output I got when I ran the above command on my ubuntu(18.04) terminal:
Fri Oct 9 09:33:56 2020
+------------------------------------------------------+
| NVIDIA-SMI 340.108 Driver Version: 340.108 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce 210 Off | 0000:01:00.0 N/A | N/A |
| 35% 52C P8 N/A / N/A | 368MiB / 1023MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| 0 Not Supported |
+-----------------------------------------------------------------------------+
Before that, I followed these steps to install required driver on my system:
sudo add-apt-repository --remove ppa:graphics-drivers/ppa
sudo apt-get purge nvidia*
sudo apt autoremove
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt update
sudo ubuntu-drivers devices
sudo ubuntu-drivers autoinstall
sudo shutdown -r now
When I tried to run Geekbench5 compute benchmark test, the output stopped when it was running Histogram Equalization. This is the output when I ran this ./geekbench5 --compute OpenCL in the folder where I extracted geekbench5:
[1009/092949:FATAL:src/halogen/cuda/cuda_library.cpp(1481)] Failed to load
cuDevicePrimaryCtxRetain: /usr/lib/x86_64-linux-gnu/libcuda.so.1: undefined symbol: cuDevicePrimaryCtxRetain
[1009/092949:FATAL:src/halogen/cuda/cuda_library.cpp(1481)] Failed to load cuDevicePrimaryCtxRetain: /usr/lib/x86_64-linux-gnu/libcuda.so.1: undefined symbol: cuDevicePrimaryCtxRetain
Geekbench 5.2.4 Tryout : https://www.geekbench.com/
Geekbench 5 is in tryout mode.
Geekbench 5 requires an active Internet connection when in tryout mode, and
automatically uploads test results to the Geekbench Browser. Other features
are unavailable in tryout mode.
Buy a Geekbench 5 license to enable offline use and remove the limitations of
tryout mode.
If you would like to purchase Geekbench you can do so online:
https://store.primatelabs.com/v5
If you have already purchased Geekbench, enter your email address and license
key from your email receipt with the following command line:
./geekbench5 -r <email address> <license key>
Running Gathering system information
System Information
Operating System Ubuntu 18.04.5 LTS 4.15.0-118-generic x86_64
Model To be filled by O.E.M. To be filled by O.E.M.
Motherboard O.E.M Intel H81
BIOS American Megatrends Inc. 4.6.5
Processor Information
Name Intel Core i5-4460
Topology 1 Processor, 4 Cores
Identifier GenuineIntel Family 6 Model 60 Stepping 3
Base Frequency 3.20 GHz
L1 Instruction Cache 32.0 KB x 2
L1 Data Cache 32.0 KB x 2
L2 Cache 256 KB x 2
L3 Cache 6.00 MB
Memory Information
Size 7.75 GB
OpenCL Information
Platform Vendor NVIDIA Corporation
Platform Name NVIDIA CUDA
Device Vendor NVIDIA Corporation
Device Name GeForce 210
Device Driver Version 340.108
Maximum Frequency 1.23 GHz
Compute Units 2
Device Memory 1024 MB
OpenCL
Running Sobel
Running Canny
Running Stereo Matching
Running Histogram Equalization
[1009/093329:ERROR:src/interface/console/consolemain.cpp(808)] Geekbench encountered an internal error and cannot continue. Please contact support#primatelabs.com for assistance.
Internal error message: clCreateImage returned -40.
Also, when I tried running the geekbench5 compute benchmark test on windows 10(same machine, on GUI), it paused running at Histogram equalization.
I am not getting any idea why this is happening.Is anything really wrong with my GPU or driver or anything else? I tried to search online, installed the driver again,rebooted the system, but the results are same. Can someone please help?

Your driver installation is fine, but your GPU is 11 years old and does not support some of the more recent features of the OpenCL standard. The geekbench error message -40 means that the image size geekbench uses for one of its benchmarks is not supported by your GPU. This causes the benchmark to crash. Maybe an older version of geekbench still works.

Related

system76 ubuntu 20.04 tensorflow gpu cuda version conflicts

After an upgrade to Ubuntu 20.04 from 18.04 Tensorflow is no longer able to use my gpu because it is attempting to mix and load different versions (some 10 and some 11). It is a System76 machine, and I have cuda 10.1 installed from System76 (so it works with the System76 nvidia driver). When running tensorflow the following errors occur:
2021-01-07 18:12:22.584886: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-01-07 18:12:22.584906: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-01-07 18:12:23.640665: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-01-07 18:12:23.641412: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-01-07 18:12:23.669966: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-07 18:12:23.670257: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 1060 computeCapability: 6.1
coreClock: 1.733GHz coreCount: 10 deviceMemorySize: 5.93GiB deviceMemoryBandwidth: 178.99GiB/s
2021-01-07 18:12:23.670328: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-01-07 18:12:23.670379: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory
2021-01-07 18:12:23.670425: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory
2021-01-07 18:12:23.671387: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-01-07 18:12:23.671667: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-01-07 18:12:23.673022: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-01-07 18:12:23.673100: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory
2021-01-07 18:12:23.673245: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-01-07 18:12:23.673259: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU.
Notice all the warnings are for attempting to load version 11 of Cuda but it's only for some of the libraries. The version 10 ones load fine.
This is the output of nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Fri_Feb__8_19:08:17_PST_2019
Cuda compilation tools, release 10.1, V10.1.105
This is the output of nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.38 Driver Version: 455.38 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 1060 Off | 00000000:01:00.0 Off | N/A |
| N/A 53C P0 26W / N/A | 585MiB / 6069MiB | 4% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2999 G /usr/lib/xorg/Xorg 101MiB |
| 0 N/A N/A 3479 G /usr/lib/xorg/Xorg 255MiB |
| 0 N/A N/A 3720 G /usr/bin/gnome-shell 88MiB |
| 0 N/A N/A 6487 G ...AAAAAAAA== --shared-files 45MiB |
| 0 N/A N/A 6959 G ...AAAAAAAA== --shared-files 40MiB |
| 0 N/A N/A 11642 G ...AAAAAAAA== --shared-files 21MiB |
| 0 N/A N/A 25206 G WickrMe 17MiB |
+-----------------------------------------------------------------------------+
I see that the driver version in the output of nvidia-smi is version 11, but as I understand it, that has nothing to do with cuda runtime. That is simply the version up to which the driver supports. Correct me if I'm wrong.
I have to use version 10 because that is what is supported by System76 and it worked fine prior to the upgrade. I have also tried uninstalling and re-installing Tensorflow via pip3 and no luck.
Does anyone know how get all the libraries in sync to version 10.1? I also tried to manually place the version 11 libraries in place and let Tensorflow use the mixed version (which of course is a bad idea) but it won't recognize them (or I didn't place them properly).
As #talonmies pointed out, I was misunderstanding the versioning system. However, because it's a System76 machine, it was also confounding because System76 uses their own Nvidia driver, and it's not straightforward to install Cuda 11 and Cudnn. I'm posting the answer in case anyone else runs into problems with System76.
First, DO NOT use the System76 install for Cuda and Cudnn. They have their own versions (on their website) so as to be compatible with their Nvidia driver, but they will not work (they are version 10, and TF 2.2+ requires 11). Also, most general Cuda guides will tell you to uninstall/install the Nvida driver first so as to have a clean install, but DO NOT do this if you have a System76 system. Just leave the System76 driver alone. Also, if you have any previous Cuda/Cudnn remove/uninstall all of it.
Go to Nvidia and get their latest Cuda and Cudnn. I used
wget http://developer.download.nvidia.com/compute/cuda/11.0.2/local_installers/cuda_11.0.2_450.51.05_linux.run
Run that with
sudo sh cuda_11.0.2_450.51.05_linux.run
When it runs it will tell you that you have a conflict with the driver package. Ignore that and proceed. When you get to the install menu, UNCHECK "install driver" and continue with the install. When it's done, add to your path
/usr/local/cuda-11.0:/usr/local/cuda-11.0/bin:
You need to add both the cuda root and bin, not just bin (which is different than most general instructions). Source your .bashrc or .profile or wherever you put the path addition (or open a new terminal).
Now install Cudnn.
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/libcudnn8_8.0.5.39-1+cuda11.0_amd64.deb
Install it with dpkg. For example (in my case)...
sudo dpkg -i libcudnn8_8.0.5.39-1+cuda11.0_amd64.deb
That's it. Once I completed all that, everything worked fine. Hope that helps some System76 people get through Ununtu 20.04 and Cuda 11 a little easier.
Thank you very much.
One of the reasons I have used POP OS is that the Nvidia drivers+cuda/cudnn just worked with tensorflow, until this issue with version 11.0 missing.
One thing I needed to be able in install cuda 11.0 using the recipe above was to install gcc versions 8 :
sudo apt -y install gcc-8 g++-8
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-8 8
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-8 8
I really wish POP!_os would provide CUDA 11.0 packages directly.....

What is the correct version of CUDNN for CUDA 11.0

I want to start using tensorflow-gpu, and I looked some stuff up, and found out that I need to ensure that I have both CUDA and CUDNN. So, I opened up the command prompt and ran the command nvidia-smi to check my CUDA version:
C:\Program Files\NVIDIA Corporation\NVSMI>nvidia-smi
Tue Jun 02 14:13:03 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 445.87 Driver Version: 445.87 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1050 WDDM | 00000000:01:00.0 Off | N/A |
| N/A 40C P8 N/A / N/A | 77MiB / 4096MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU PID Type Process name GPU Memory |
| Usage |
|=============================================================================|
| 0 10488 C+G ...n64\EpicGamesLauncher.exe N/A |
| 0 12636 C+G ...4\UnrealCEFSubProcess.exe N/A |
+-----------------------------------------------------------------------------+
Now that I see my CUDA version is 11.0, I went to the NVidia's website to select a version of CUDNN that can work with CUDA 11.0, but the latest ones support up to CUDA 10.2 currently. What should I do? Can I use the one for CUDA 10.2?
What nvidia-smi shows is not the CUDA version that you have installed, but the maximum CUDA version that your driver supports.
CUDA 11.0 has been announced but not released yet (as of June 2nd 2020), so you should use CUDA 10.2 as it's the latest available version.
A couple of weeks ago, I have upgraded three of them to the new cuda_11.0.2, Driver 450.51.06 and cuDNN _8.0.
My environment:
86-64
Centos 7 with gcc 4.8.5 ( sudo doesn't work in Centos. Login as root)
I downloaded cuda_11.0.2-450.51.05_linux.run
I took a risk but it went fine. On Nvidia cudnn matrix it said:
Compute > 3.5, toolkit =11.0 , and driver r450
So the driver and toolkit minors doesn't matter.
Installed, and went through pre-, post- and recommended.
Everything went fine.
*This is very important
My cudnn installed but couldn't run the examples.
If you are an Engineer, you have went through such dilemma because you bypass some small details.
Gcc 4.8.5 if for installing toolkit and driver.
Cudnn 8.0 needs gcc 5 and above for c++ 11 or 14 for tool chain.
So what I have done is that( I have a lot of. devtoolset versions in my environment).
I choose 6.0 version instead of 5 to make not be on the border line.
Re-install it, you will be cool.
***Regarding tensor-flow×××: It has nothing to do with cudnn other than kera for python if I get this right.

Tensorflow complains that no CUDA-capable device is detected

I'm trying to run some Tensorflow code, and I get what seems to be a common problem:
$ LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64 python -c "import tensorflow; tensorflow.Session()"
2019-02-06 20:36:15.903204: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-02-06 20:36:15.908809: E tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2019-02-06 20:36:15.908858: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] retrieving CUDA diagnostic information for host: tigris
2019-02-06 20:36:15.908868: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:170] hostname: tigris
2019-02-06 20:36:15.908942: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:194] libcuda reported version is: 390.77.0
2019-02-06 20:36:15.908985: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:198] kernel reported version is: 390.30.0
2019-02-06 20:36:15.909006: E tensorflow/stream_executor/cuda/cuda_diagnostics.cc:308] kernel version 390.30.0 does not match DSO version 390.77.0 -- cannot find working devices in this configuration
$
The key pieces of that error message seem to be:
[...] libcuda reported version is: 390.77.0
[...] kernel reported version is: 390.30.0
[...] kernel version 390.30.0 does not match DSO version 390.77.0 -- cannot find working devices in this configuration
How can I install compatible versions? Where is that libcuda version coming from?
Background
A few months ago, I tried installing Tensorflow with GPU support, but the versions either broke my display or wouldn't work with Tensorflow. Finally, I got it working by following a tutorial on how to install multiple versions of the CUDA libraries on the same machine. That worked at the time, but when I came back to the project after a few months, it has stopped working. I assume that some driver got upgraded during that time.
Investigation
The first thing I tried was to see what versions I have of the nvidia drivers and libcuda package.
$ dpkg --list|grep libcuda
ii libcuda1-390 390.30-0ubuntu1 amd64 NVIDIA CUDA runtime library
Looks like it's 390.30. Why does the error message say that libcuda reported 390.77?
$ dpkg --list|grep nvidia
ii libnvidia-container-tools 1.0.1-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.0.1-1 amd64 NVIDIA container runtime library
rc nvidia-384 384.130-0ubuntu0.16.04.1 amd64 NVIDIA binary driver - version 384.130
ii nvidia-390 390.30-0ubuntu1 amd64 NVIDIA binary driver - version 390.30
ii nvidia-390-dev 390.30-0ubuntu1 amd64 NVIDIA binary Xorg driver development files
rc nvidia-396 396.44-0ubuntu1 amd64 NVIDIA binary driver - version 396.44
ii nvidia-container-runtime 2.0.0+docker18.09.1-1 amd64 NVIDIA container runtime
ii nvidia-container-runtime-hook 1.4.0-1 amd64 NVIDIA container runtime hook
ii nvidia-docker2 2.0.3+docker18.09.1-1 all nvidia-docker CLI wrapper
ii nvidia-modprobe 390.30-0ubuntu1 amd64 Load the NVIDIA kernel driver and create device files
rc nvidia-opencl-icd-384 384.130-0ubuntu0.16.04.1 amd64 NVIDIA OpenCL ICD
ii nvidia-opencl-icd-390 390.30-0ubuntu1 amd64 NVIDIA OpenCL ICD
rc nvidia-opencl-icd-396 396.44-0ubuntu1 amd64 NVIDIA OpenCL ICD
ii nvidia-prime 0.8.8.2 all Tools to enable NVIDIA's Prime
ii nvidia-settings 396.44-0ubuntu1 amd64 Tool for configuring the NVIDIA graphics driver
Again, everything looks like it's 390.30. There were some packages that had version 390.77, but they were in the rc status. I guess I installed that version and later removed it, so the configuration files were left behind. I purged the configuration files with commands like this:
sudo apt-get remove --purge nvidia-kernel-common-390
Now, there are no packages at all with version 390.77.
$ dpkg --list|grep 390.77
$
I tried reinstalling CUDA, to see if it had been compiled with the wrong version.
$ sudo sh cuda_9.0.176_384.81_linux.run --silent --toolkit --toolkitpath=/usr/local/cuda-9.0 --override
That didn't make any difference.
Finally, I tried running nvidia-smi.
$ LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64 nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
$
All of this is running on Ubuntu 18.04 with Python 3.6.7, and my graphics card is NVIDIA Corporation GM107M [GeForce GTX 960M] (rev a2).
I finally had the idea to look for any files with 390.77 in the name.
$ locate 390.77
/usr/lib/i386-linux-gnu/libcuda.so.390.77
/usr/lib/i386-linux-gnu/libnvcuvid.so.390.77
/usr/lib/i386-linux-gnu/libnvidia-compiler.so.390.77
/usr/lib/i386-linux-gnu/libnvidia-encode.so.390.77
/usr/lib/i386-linux-gnu/libnvidia-fatbinaryloader.so.390.77
/usr/lib/i386-linux-gnu/libnvidia-ml.so.390.77
/usr/lib/i386-linux-gnu/libnvidia-opencl.so.390.77
/usr/lib/i386-linux-gnu/libnvidia-ptxjitcompiler.so.390.77
/usr/lib/i386-linux-gnu/vdpau/libvdpau_nvidia.so.390.77
/usr/lib/x86_64-linux-gnu/libcuda.so.390.77
/usr/lib/x86_64-linux-gnu/libnvcuvid.so.390.77
/usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.390.77
/usr/lib/x86_64-linux-gnu/libnvidia-encode.so.390.77
/usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.390.77
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.390.77
/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.390.77
/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.390.77
/usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.390.77
So there they are! A closer look shows that I must have installed the newer version at some point.
$ ls /usr/lib/i386-linux-gnu/libcuda* -l
lrwxrwxrwx 1 root root 12 Nov 8 13:58 /usr/lib/i386-linux-gnu/libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root 17 Nov 12 14:04 /usr/lib/i386-linux-gnu/libcuda.so.1 -> libcuda.so.390.77
-rw-r--r-- 1 root root 9179124 Jan 31 2018 /usr/lib/i386-linux-gnu/libcuda.so.390.30
-rw-r--r-- 1 root root 9179796 Jul 10 2018 /usr/lib/i386-linux-gnu/libcuda.so.390.77
Where did they come from?
$ dpkg -S /usr/lib/i386-linux-gnu/libcuda.so.390.30
libcuda1-390: /usr/lib/i386-linux-gnu/libcuda.so.390.30
$ dpkg -S /usr/lib/i386-linux-gnu/libcuda.so.390.77
dpkg-query: no path found matching pattern /usr/lib/i386-linux-gnu/libcuda.so.390.77
So the 390.77 no longer belongs to any package. Perhaps I installed the old version and had to force it to overwrite the links.
My plan is to delete the files, then reinstall the packages to set up the links to the correct version. So which packages will I need to reinstall?
$ locate 390.77|sed -e 's/390.77/390.30/'|xargs dpkg -S
Some of the files don't match anything, but the ones that do match are from these packages:
libcuda1-390
nvidia-opencl-icd-390
Crossing my fingers, I delete the version 390.77 files.
locate 390.77|sudo xargs rm
Then I reinstall the packages.
sudo apt-get install --reinstall libcuda1-390 nvidia-opencl-icd-390
Finally, it works!
$ LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64 python -c "import tensorflow; tensorflow.Session()"
2019-02-06 22:13:59.460822: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-02-06 22:13:59.665756: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-02-06 22:13:59.666205: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: GeForce GTX 960M major: 5 minor: 0 memoryClockRate(GHz): 1.176
pciBusID: 0000:01:00.0
totalMemory: 3.95GiB freeMemory: 3.81GiB
2019-02-06 22:13:59.666226: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-06 22:17:21.254445: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-06 22:17:21.254489: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-06 22:17:21.254496: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-06 22:17:21.290992: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3539 MB memory) -> physical GPU (device: 0, name: GeForce GTX 960M, pci bus id: 0000:01:00.0, compute capability: 5.0)
nvidia-smi also works now.
$ LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64 nvidia-smi
Wed Feb 6 22:19:24 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30 Driver Version: 390.30 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 960M Off | 00000000:01:00.0 Off | N/A |
| N/A 45C P8 N/A / N/A | 113MiB / 4046MiB | 6% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 3212 G /usr/lib/xorg/Xorg 113MiB |
+-----------------------------------------------------------------------------+
I rebooted, and the video drivers continued to work. Hurrah!

What is the reason that TensorFlow does not detect GPU on Windows

I have installed CUDA 9.0 on my machine which has the NVIDIA GTX 1080 graphics cards. When I run the command nvcc --version then I get:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:32_Central_Daylight_Time_2017
Cuda compilation tools, release 9.0, V9.0.176
But I have tried the steps from the TensorFlow official site to install TF with GPU support, but it still using the CPU.
I have tried pip install and Anaconda install, all was the same result. No one was able to detect GPU, then I have tried many other tutorials on the web, which they were able to detect the GPU, but I am not.
What can be the reason, is there any changing in the new GPU version of TF? If yes, then what is the latest documentation to install TF with GPU support, if not, then where I am doing wrong.
Thanks!
Update1: Tensorflow really wastes my time. Very annoying, at the first I decided to build TF from source, to use it with CUDA 10, but on both OS Windows 10 and Ubuntu 18.04 I was unable to build it successfully. So I gave up, then I decided to use with CUDA 9.0, which is not supported in Ubuntu 18.04, so I came back to windows, but even still the prebuilt library of TF not working, really annoying.
I don't know why TF still using CUDA 9.0 which CUDA 10.0 already officially released, and TF still not supporting Python 3.7? amazing not? and the same thing with MS Build Tools 2015, which 2017 already exist, and many more tools. TF relays on old versions of the tools which make a lot of problem for some people that they must uninstall their new versions which still using, it is very annoying...
Update2: nvidia-smi output:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 417.71 Driver Version: 417.71 CUDA Version: 9.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1080 WDDM | 00000000:01:00.0 On | N/A |
| 27% 35C P8 8W / 180W | 498MiB / 8192MiB | 1% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1264 C+G Insufficient Permissions N/A |
| 0 2148 C+G ...0108.0_x64__8wekyb3d8bbwe\HxOutlook.exe N/A |
| 0 4360 C+G ...mmersiveControlPanel\SystemSettings.exe N/A |
| 0 7332 C+G C:\Windows\explorer.exe N/A |
| 0 7384 C+G ...t_cw5n1h2txyewy\ShellExperienceHost.exe N/A |
| 0 8488 C+G ...dows.Cortana_cw5n1h2txyewy\SearchUI.exe N/A |
| 0 9704 C+G ...osoft.LockApp_cw5n1h2txyewy\LockApp.exe N/A |
| 0 10588 C+G ...al\Google\Chrome\Application\chrome.exe N/A |
| 0 10904 C+G ...x64__8wekyb3d8bbwe\Microsoft.Photos.exe N/A |
| 0 12608 C+G ...DIA GeForce Experience\NVIDIA Share.exe N/A |
| 0 13000 C+G ...241.0_x64__8wekyb3d8bbwe\Calculator.exe N/A |
| 0 14668 C+G ...ng4wbp0\app\DellMobileConnectClient.exe N/A |
| 0 17628 C+G ...2.0_x64__8wekyb3d8bbwe\WinStore.App.exe N/A |
| 0 18060 C+G ...oftEdge_8wekyb3d8bbwe\MicrosoftEdge.exe N/A |
+-----------------------------------------------------------------------------+
I finally figured out the problem. this may help others It is a bug with TF 1.12, I have removed and reinstalled TF 1.11 which it is able to detect GPU.
Some suggestions to TF team:
Before releasing a new version, please make sure that it is working
in all OS systems without any bugs (the bugs like this which I
against is really a very low lever bug)
Please also refresh your third-party libraries to support the newest
versions, e.g: CUDA 10, which I had installed in my machine, but because of installing TF I stepped back to 9.0; annoying. VS 2015,
Python 3.7, and and and ... as well.
Please update the documentation, for both install and building from
source, describe everything clearly, what needs, what to install, how to build all of the need tools and utils must be described clearly. In the documentation, I found that building TF from source is
very very easy, but in reality was not, there I found a lot of errors
like the others, so I was unable to build from source.
Till now I found the TF the most annoying framework, building and installing. TF is very sensitive the happening of errors probability in both building or installing is very high.
Good Luck!!

CUDA_ERROR_OUT_OF_MEMORY in tensorflow

When I started to train some neural network, it met the CUDA_ERROR_OUT_OF_MEMORY but the training could go on without error. Because I wanted to use gpu memory as it really needs, so I set the gpu_options.allow_growth = True.The logs are as follows:
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so locally
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:951] Found device 0 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:01:00.0
Total memory: 7.92GiB
Free memory: 7.81GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:972] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:0) -> (device:0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0)
E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
Iter 20, Minibatch Loss= 40491.636719
...
And after using nvidia-smi command, it gets:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.27 Driver Version: 367.27
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M.
|===============================+======================+======================|
| 0 GeForce GTX 1080 Off | 0000:01:00.0 Off | N/A |
| 40% 61C P2 46W / 180W | 8107MiB / 8111MiB | 96% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 1080 Off | 0000:02:00.0 Off | N/A |
| 0% 40C P0 40W / 180W | 0MiB / 8113MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
│
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 22932 C python 8105MiB |
+-----------------------------------------------------------------------------+
After I commented the gpu_options.allow_growth = True, I trained the net again and everything was normal. There was no the problem of CUDA_ERROR_OUT_OF_MEMORY. Finally, ran the nvidia-smi command, it gets:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.27 Driver Version: 367.27
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M.
|===============================+======================+======================|
| 0 GeForce GTX 1080 Off | 0000:01:00.0 Off | N/A |
| 40% 61C P2 46W / 180W | 7793MiB / 8111MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 1080 Off | 0000:02:00.0 Off | N/A |
| 0% 40C P0 40W / 180W | 0MiB / 8113MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
│
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 22932 C python 7791MiB |
+-----------------------------------------------------------------------------+
I have two questions about it. Why did the CUDA_OUT_OF_MEMORY come out and the procedure went on normally? why did the memory usage become smaller after commenting allow_growth = True.
In case it's still relevant for someone, I encountered this issue when trying to run Keras/Tensorflow for the second time, after a first run was aborted. It seems the GPU memory is still allocated, and therefore cannot be allocated again. It was solved by manually ending all python processes that use the GPU, or alternatively, closing the existing terminal and running again in a new terminal window.
By default, tensorflow try to allocate a fraction per_process_gpu_memory_fraction of the GPU memory to his process to avoid costly memory management. (See the GPUOptions comments).
This can fail and raise the CUDA_OUT_OF_MEMORY warnings.
I do not know what is the fallback in this case (either using CPU ops or a allow_growth=True).
This can happen if an other process uses the GPU at the moment (If you launch two process running tensorflow for instance).
The default behavior takes ~95% of the memory (see this answer).
When you use allow_growth = True, the GPU memory is not preallocated and will be able to grow as you need it. This will lead to smaller memory usage (as the default option is to use the whole memory) but decreases the perfomances if not use properly as it requires a more complex handeling of the memory (which is not the most efficient part of CPU/GPU interactions).
I faced this issue when trying to train model back to back. I figured that the GPU memory wasn't available due to previous training run. So I found the easiest way would be to manually flush the GPU memory before every next training.
Use nvidia-smi to check the GPU memory usage:
nvidia-smi
nvidia-smi --gpu-reset
The above command may not work if other processes are actively using the GPU.
Alternatively you can use the following command to list all the processes that are using GPU:
sudo fuser -v /dev/nvidia*
And the output should look like this:
USER PID ACCESS COMMAND
/dev/nvidia0: root 2216 F...m Xorg
sid 6114 F...m krunner
sid 6116 F...m plasmashell
sid 7227 F...m akonadi_archive
sid 7239 F...m akonadi_mailfil
sid 7249 F...m akonadi_sendlat
sid 18120 F...m chrome
sid 18163 F...m chrome
sid 24154 F...m code
/dev/nvidiactl: root 2216 F...m Xorg
sid 6114 F...m krunner
sid 6116 F...m plasmashell
sid 7227 F...m akonadi_archive
sid 7239 F...m akonadi_mailfil
sid 7249 F...m akonadi_sendlat
sid 18120 F...m chrome
sid 18163 F...m chrome
sid 24154 F...m code
/dev/nvidia-modeset: root 2216 F.... Xorg
sid 6114 F.... krunner
sid 6116 F.... plasmashell
sid 7227 F.... akonadi_archive
sid 7239 F.... akonadi_mailfil
sid 7249 F.... akonadi_sendlat
sid 18120 F.... chrome
sid 18163 F.... chrome
sid 24154 F.... code
From here, I got the PID for the process which was holding the GPU memory, which in my case is 24154.
Use the following command to kill the process by its PID:
sudo kill -9 MY_PID
Replace MY_PID with the relevant PID.
Tensorflow 2.0 alpha
The problem is, that Tensorflow is greedy in allocating all available VRAM. That causes issues for some people.
For Tensorflow 2.0 alpha / nightly use this:
import tensorflow as tf
tf.config.gpu.set_per_process_memory_fraction(0.4)
Source: https://www.tensorflow.org/alpha/guide/using_gpu
I was experienced memory error in Ubuntu 18.10.
When i changed resolution of my monitor from 4k to fullhd (1920-1080) memory available become 438mb and neural network training started.
I was really surprised by this behavior.
By the way, i have Nvidia 1080 with 8gb memory, still dont know why only 400mb available
Environment:
1.CUDA 10.0
2.cuNDD 10.0
3.tensorflow 1.14.0
4.pip install opencv-contrib-python
5.git clone https://github.com/thtrieu/darkflow
6.Allowing GPU memory growth
Reference
fuser -k /dev/nvidia[0]
Worked for me.
Thanks to https://forums.developer.nvidia.com/t/11-gb-of-gpu-ram-used-and-no-process-listed-by-nvidia-smi/44459/16
Check the correctness of the input dataset.
İf you have a null input list may occur this error too.
The situation that I faced in Colab with tf.keras