Segmentation fault (core dumped) on tf.Session() - tensorflow

I am new with TensorFlow.
I just installed TensorFlow and to test the installation, I tried the following code and as soon as I initiate the TF Session, I am getting the Segmentation fault (core dumped) error.
bafhf#remote-server:~$ python
Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
/home/bafhf/anaconda3/envs/ismll/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
>>> tf.Session()
2018-05-15 12:04:15.461361: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1349] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:04:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
Segmentation fault (core dumped)
My nvidia-smi is:
Tue May 15 12:12:26 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30 Driver Version: 390.30 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 00000000:04:00.0 Off | 0 |
| N/A 38C P8 26W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 On | 00000000:05:00.0 Off | 2 |
| N/A 31C P8 29W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
And nvcc --version is:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176
Also gcc --version is:
gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Following is my PATH:
/home/bafhf/bin:/home/bafhf/.local/bin:/usr/local/cuda/bin:/usr/local/cuda/lib:/usr/local/cuda/extras/CUPTI/lib:/home/bafhf/anaconda3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
and the LD_LIBRARY_PATH:
/usr/local/cuda/bin:/usr/local/cuda/lib:/usr/local/cuda/extras/CUPTI/lib
I am running this on a server and I don't have root privileges. Still I managed to install everything as per the instructions on the official website.
Edit: New observations:
Seems like the GPU is allocating memory for the process for a second and then the core segmentation dumped error is thrown:
Edit2: Changed tensorflow version
I downgraded my tensorflow version from v1.8 to v1.5. The issue still remains.
Is there any way address or debug this issue?

This could possibly occur since you are using multiple GPUs here. Try setting cuda visible devices to just one of the GPUs. See this linkfor instructions on how to do that. In my case, this solved the problem.

If you can see the nvidia-smi output, the second GPU has an ECC code of 2. This error manifests itself irrespective of a CUDA version or TF version error, and usually as a segfault, and sometimes, with the CUDA_ERROR_ECC_UNCORRECTABLE flag in the stack trace.
I got to this conclusion from this post:
"Uncorrectable ECC error" usually refers to a hardware failure. ECC is
Error Correcting Code, a means to detect and correct errors in bits
stored in RAM. A stray cosmic ray can disrupt one bit stored in RAM
every once in a great while, but "uncorrectable ECC error" indicates
that several bits are coming out of RAM storage "wrong" - too many for
the ECC to recover the original bit values.
This could mean that you have a bad or marginal RAM cell in your GPU
device memory.
Marginal circuits of any kind may not fail 100%, but are more likely
to fail under the stress of heavy use - and associated rise in
temperature.
A reboot usually is supposed to take away the ECC error. If not, seems like the only option is to change the hardware.
So what all I did and finally how I fixed the issue?
I tested my code a on a separate machcine with NVIDIA 1050 Ti
machine and my code executed perfectly fine.
I made the code run only on the first card for which the ECC
value was normal, just to narrow down the issue. This I did
following, this post, setting the
CUDA_VISIBLE_DEVICES environment variable.
I then requested for restart of the Tesla-K80 server to check
whether a restart can fix this issue, they took a while but the
server was then restarted
Now the issue is no more and I can run both the cards for my
tensorflow implemntations.

In case anyone still interested in, I happened to had the same issue, with "Volatile Uncorr. ECC" output. My problem was incompatible versions as shown below:
Loaded runtime CuDNN library: 7.1.1 but source was compiled with:
7.2.1. CuDNN library major and minor version needs to match or have higher minor version in case of CuDNN 7.0 or later version. If using a
binary install, upgrade your CuDNN library. If building from sources,
make sure the library loaded at runtime is compatible with the version
specified during compile configuration. Segmentation fault
After I upgrade CuDNN library to 7.3.1 (which is greater than 7.2.1), segmentation fault error disappeared. To upgrade I did the following (as also documented in here).
Download CuDNN library from NVIDIA website
sudo tar -xzvf [TAR_FILE]
sudo cp cuda/include/cudnn.h /usr/local/cuda/include
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*

I was also facing the same issue. I have a workaround for the same you can try that.
I followed the following steps:
1. Reinstall the python 3.5 or above
2. Reinstall the Cuda and Add the Cudnn libraries to it.
3. Reinstall Tensorflow 1.8.0 GPU version.

Check that you are using the exact version of CUDA and CuDNN required by tensorflow, and also that you are using the version of driver of the graphics card that comes with this CUDA version.
I once had a similar issue having a driver that was too recent. Downgrading it to the version coming with the CUDA version required by tensorflow solved the issue for me.

I encounter this problem recently.
The reason is multiple GPUs in docker container.
The solution is pretty simple, you either:
set CUDA_VISIBLE_DEVICES in host
refers to https://stackoverflow.com/a/50464695/2091555
or
use --ipc=host to launch the docker if you need multiple GPUs
e.g.
docker run --runtime nvidia --ipc host \
--rm -it
nvidia/cuda:10.0-cudnn7-runtime-ubuntu16.04:latest
This problem is actually pretty nasty, and segfault happens during cuInit() calls in docker container and everything works fine in the host. I will leave log here to let the search engine find this answer easier for other people.
(base) root#e121c445c1eb:~# conda install pytorch torchvision cudatoolkit=10.0 -c pytorch
Collecting package metadata (current_repodata.json): / Segmentation fault (core dumped)
(base) root#e121c445c1eb:~# gdb python /data/corefiles/core.conda.572.1569384636
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...done.
warning: core file may not match specified executable file.
[New LWP 572]
[New LWP 576]
warning: Unexpected size of section `.reg-xstate/572' in core file.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/opt/conda/bin/python /opt/conda/bin/conda upgrade conda'.
Program terminated with signal SIGSEGV, Segmentation fault.
warning: Unexpected size of section `.reg-xstate/572' in core file.
#0 0x00007f829f0a55fb in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so
[Current thread is 1 (Thread 0x7f82bbfd7700 (LWP 572))]
(gdb) bt
#0 0x00007f829f0a55fb in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so
#1 0x00007f829f06e3a5 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so
#2 0x00007f829f07002c in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so
#3 0x00007f829f0e04f7 in cuInit () from /usr/lib/x86_64-linux-gnu/libcuda.so
#4 0x00007f82b99a1ec0 in ffi_call_unix64 () from /opt/conda/lib/python3.7/lib-dynload/../../libffi.so.6
#5 0x00007f82b99a187d in ffi_call () from /opt/conda/lib/python3.7/lib-dynload/../../libffi.so.6
#6 0x00007f82b9bb7f7e in _call_function_pointer (argcount=1, resmem=0x7ffded858980, restype=<optimized out>, atypes=0x7ffded858940, avalues=0x7ffded858960, pProc=0x7f829f0e0380 <cuInit>,
flags=4353) at /usr/local/src/conda/python-3.7.3/Modules/_ctypes/callproc.c:827
#7 _ctypes_callproc () at /usr/local/src/conda/python-3.7.3/Modules/_ctypes/callproc.c:1184
#8 0x00007f82b9bb89b4 in PyCFuncPtr_call () at /usr/local/src/conda/python-3.7.3/Modules/_ctypes/_ctypes.c:3969
#9 0x000055c05db9bd2b in _PyObject_FastCallKeywords () at /tmp/build/80754af9/python_1553721932202/work/Objects/call.c:199
#10 0x000055c05dbf7026 in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>) at /tmp/build/80754af9/python_1553721932202/work/Python/ceval.c:4619
#11 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1553721932202/work/Python/ceval.c:3124
#12 0x000055c05db9a79b in function_code_fastcall (globals=<optimized out>, nargs=0, args=<optimized out>, co=<optimized out>)
at /tmp/build/80754af9/python_1553721932202/work/Objects/call.c:283
#13 _PyFunction_FastCallKeywords () at /tmp/build/80754af9/python_1553721932202/work/Objects/call.c:408
#14 0x000055c05dbf2846 in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>) at /tmp/build/80754af9/python_1553721932202/work/Python/ceval.c:4616
#15 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1553721932202/work/Python/ceval.c:3124
... (stack omitted)
#46 0x000055c05db9aa27 in _PyFunction_FastCallKeywords () at /tmp/build/80754af9/python_1553721932202/work/Objects/call.c:433
---Type <return> to continue, or q <return> to quit---q
Quit
Another try is using pip to install
(base) root#e121c445c1eb:~# pip install torch torchvision
(base) root#e121c445c1eb:~# python
Python 3.7.3 (default, Mar 27 2019, 22:11:17)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
Segmentation fault (core dumped)
(base) root#e121c445c1eb:~# gdb python /data/corefiles/core.python.28.1569385311
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...done.
warning: core file may not match specified executable file.
[New LWP 28]
warning: Unexpected size of section `.reg-xstate/28' in core file.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
bt
Core was generated by `python'.
Program terminated with signal SIGSEGV, Segmentation fault.
warning: Unexpected size of section `.reg-xstate/28' in core file.
#0 0x00007ffaa1d995fb in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
(gdb) bt
#0 0x00007ffaa1d995fb in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#1 0x00007ffaa1d623a5 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2 0x00007ffaa1d6402c in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3 0x00007ffaa1dd44f7 in cuInit () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4 0x00007ffaee75f724 in cudart::globalState::loadDriverInternal() () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#5 0x00007ffaee760643 in cudart::__loadDriverInternalUtil() () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#6 0x00007ffafe2cda99 in __pthread_once_slow (once_control=0x7ffaeebe2cb0 <cudart::globalState::loadDriver()::loadDriverControl>,
... (stack omitted)

I am using tensorflow in a cloud enviornment from paperspace.
Update of cuDNN 7.3.1 did not work for me.
One way is to build Tensorflow with proper GPU and CPU support.
This is not proper solution but this solved my issue temporarily (downgrade tensoflow to 1.5.0):
pip uninstall tensorflow-gpu
pip install tensorflow==1.5.0
pip install numpy==1.14.0
pip install six==1.10.0
pip install joblib==0.12
Hope this helps !

Related

PackagesNotFoundError When Trying to Install intel_extension_for_pytorch

I am trying to conda install intel_extension_for_pytorch but I keep getting the following error in the command line:
PackagesNotFoundError: The following packages are not available from current channels:
intel_extension_for_pytorch
this is the command that I am using
conda install intel_extension_for_pytorch
edit:
System Info:
Microsoft Windows [Version 10.0.19044.2006]
Processor 11th Gen Intel(R) Core(TM) i7-1185G7 # 3.00GHz, 2995 Mhz, 4 Core(s), 8 Logical Processor(s)
Currently, the Intel Extension for PyTorch is only supported by Linux OS. Try on a recent Linux version, it should work there.
Check out the docs for more info: https://www.intel.com/content/www/us/en/developer/tools/oneapi/extension-for-pytorch.html

ValueError: invalid literal for int() with base 10: '' while building tensorflow from source with gpu support [duplicate]

When I install tensorflow-gpu through Conda; it gives me the following output:
conda install tensorflow-gpu
Collecting package metadata (current_repodata.json): done
Solving environment: done
## Package Plan ##
environment location: /home/psychotechnopath/anaconda3/envs/DeepLearning3.6
added / updated specs:
- tensorflow-gpu
The following packages will be downloaded:
package | build
---------------------------|-----------------
_tflow_select-2.1.0 | gpu 2 KB
cudatoolkit-10.1.243 | h6bb024c_0 347.4 MB
cudnn-7.6.5 | cuda10.1_0 179.9 MB
cupti-10.1.168 | 0 1.4 MB
tensorflow-2.1.0 |gpu_py36h2e5cdaa_0 4 KB
tensorflow-base-2.1.0 |gpu_py36h6c5654b_0 155.9 MB
tensorflow-gpu-2.1.0 | h0d30ee6_0 3 KB
------------------------------------------------------------
Total: 684.7 MB
The following NEW packages will be INSTALLED:
cudatoolkit pkgs/main/linux-64::cudatoolkit-10.1.243-h6bb024c_0
cudnn pkgs/main/linux-64::cudnn-7.6.5-cuda10.1_0
cupti pkgs/main/linux-64::cupti-10.1.168-0
tensorflow-gpu pkgs/main/linux-64::tensorflow-gpu-2.1.0-h0d30ee6_0
I see that installing tensorflow-gpu automatically triggers the installation of the cudatoolkit and cudnn. Does this mean that I no longer need to install CUDA and CUDNN manually anymore to be able to use tensorflow-gpu? Where does this conda installation of CUDA reside?
I first installed CUDA and CuDNN the old way (e.g. by following these installation instructions: https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html )
And then I noticed that tensorflow-gpu was also installing cuda and cudnn
Do i now have two versions of CUDA/CuDNN installed and how do I check this?
Do i now have two versions of CUDA installed and how do I check this?
No.
conda installs the bare minimum redistributable library components required to support the CUDA accelerated packages they offer. The package name cudatoolkit is a complete misnomer. It is nothing of the sort. Even though it is now greatly expanded in scope from what it used to be (literally 5 files -- I think at some point they must have gotten a licensing deal from NVIDIA because some of this wasn't/isn't on the official "freely redistributable" list AFAIK), it still is basically just a handful of libraries.
You can check this for yourself:
cat /opt/miniconda3/conda-meta/cudatoolkit-10.1.168-0.json
{
"build": "0",
"build_number": 0,
"channel": "https://repo.anaconda.com/pkgs/main/linux-64",
"constrains": [],
"depends": [],
"extracted_package_dir": "/opt/miniconda3/pkgs/cudatoolkit-10.1.168-0",
"features": "",
"files": [
"lib/cudatoolkit_config.yaml",
"lib/libcublas.so",
"lib/libcublas.so.10",
"lib/libcublas.so.10.2.0.168",
"lib/libcublasLt.so",
"lib/libcublasLt.so.10",
"lib/libcublasLt.so.10.2.0.168",
"lib/libcudart.so",
"lib/libcudart.so.10.1",
"lib/libcudart.so.10.1.168",
"lib/libcufft.so",
"lib/libcufft.so.10",
"lib/libcufft.so.10.1.168",
"lib/libcufftw.so",
"lib/libcufftw.so.10",
"lib/libcufftw.so.10.1.168",
"lib/libcurand.so",
"lib/libcurand.so.10",
"lib/libcurand.so.10.1.168",
"lib/libcusolver.so",
"lib/libcusolver.so.10",
"lib/libcusolver.so.10.1.168",
"lib/libcusparse.so",
"lib/libcusparse.so.10",
"lib/libcusparse.so.10.1.168",
"lib/libdevice.10.bc",
"lib/libnppc.so",
"lib/libnppc.so.10",
"lib/libnppc.so.10.1.168",
"lib/libnppial.so",
"lib/libnppial.so.10",
"lib/libnppial.so.10.1.168",
"lib/libnppicc.so",
"lib/libnppicc.so.10",
"lib/libnppicc.so.10.1.168",
"lib/libnppicom.so",
"lib/libnppicom.so.10",
"lib/libnppicom.so.10.1.168",
"lib/libnppidei.so",
"lib/libnppidei.so.10",
"lib/libnppidei.so.10.1.168",
"lib/libnppif.so",
"lib/libnppif.so.10",
"lib/libnppif.so.10.1.168",
"lib/libnppig.so",
"lib/libnppig.so.10",
"lib/libnppig.so.10.1.168",
"lib/libnppim.so",
"lib/libnppim.so.10",
"lib/libnppim.so.10.1.168",
"lib/libnppist.so",
"lib/libnppist.so.10",
"lib/libnppist.so.10.1.168",
"lib/libnppisu.so",
"lib/libnppisu.so.10",
"lib/libnppisu.so.10.1.168",
"lib/libnppitc.so",
"lib/libnppitc.so.10",
"lib/libnppitc.so.10.1.168",
"lib/libnpps.so",
"lib/libnpps.so.10",
"lib/libnpps.so.10.1.168",
"lib/libnvToolsExt.so",
"lib/libnvToolsExt.so.1",
"lib/libnvToolsExt.so.1.0.0",
"lib/libnvblas.so",
"lib/libnvblas.so.10",
"lib/libnvblas.so.10.2.0.168",
"lib/libnvgraph.so",
"lib/libnvgraph.so.10",
"lib/libnvgraph.so.10.1.168",
"lib/libnvjpeg.so",
"lib/libnvjpeg.so.10",
"lib/libnvjpeg.so.10.1.168",
"lib/libnvrtc-builtins.so",
"lib/libnvrtc-builtins.so.10.1",
"lib/libnvrtc-builtins.so.10.1.168",
"lib/libnvrtc.so",
"lib/libnvrtc.so.10.1",
"lib/libnvrtc.so.10.1.168",
"lib/libnvvm.so",
"lib/libnvvm.so.3",
"lib/libnvvm.so.3.3.0"
]
.....
i.e. what you get is (keeping in mind most of those "files" above are just symlinks)
CUBLAS runtime
The CUDA runtime library
CUFFT runtime
CUrand runtime
CUsparse rutime
CUsolver runtime
NPP runtime
nvblas runtime
NVTX runtime
NVgraph runtime
NVjpeg runtime
NVRTC/NVVM runtime
The CUDNN package that conda installs is the redistributable binary distribution which is identical to what NVIDIA distribute -- which is exactly two files, a header file and a library.
You would still require a supported NVIDIA driver installation to make the tensorflow which conda installs work.
If you want to actually compile and build CUDA code, you need to install a separate CUDA toolkit which contains all the the development components which conda deliberately omits from their distribution.

Tensorflow complains that no CUDA-capable device is detected

I'm trying to run some Tensorflow code, and I get what seems to be a common problem:
$ LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64 python -c "import tensorflow; tensorflow.Session()"
2019-02-06 20:36:15.903204: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-02-06 20:36:15.908809: E tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2019-02-06 20:36:15.908858: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] retrieving CUDA diagnostic information for host: tigris
2019-02-06 20:36:15.908868: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:170] hostname: tigris
2019-02-06 20:36:15.908942: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:194] libcuda reported version is: 390.77.0
2019-02-06 20:36:15.908985: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:198] kernel reported version is: 390.30.0
2019-02-06 20:36:15.909006: E tensorflow/stream_executor/cuda/cuda_diagnostics.cc:308] kernel version 390.30.0 does not match DSO version 390.77.0 -- cannot find working devices in this configuration
$
The key pieces of that error message seem to be:
[...] libcuda reported version is: 390.77.0
[...] kernel reported version is: 390.30.0
[...] kernel version 390.30.0 does not match DSO version 390.77.0 -- cannot find working devices in this configuration
How can I install compatible versions? Where is that libcuda version coming from?
Background
A few months ago, I tried installing Tensorflow with GPU support, but the versions either broke my display or wouldn't work with Tensorflow. Finally, I got it working by following a tutorial on how to install multiple versions of the CUDA libraries on the same machine. That worked at the time, but when I came back to the project after a few months, it has stopped working. I assume that some driver got upgraded during that time.
Investigation
The first thing I tried was to see what versions I have of the nvidia drivers and libcuda package.
$ dpkg --list|grep libcuda
ii libcuda1-390 390.30-0ubuntu1 amd64 NVIDIA CUDA runtime library
Looks like it's 390.30. Why does the error message say that libcuda reported 390.77?
$ dpkg --list|grep nvidia
ii libnvidia-container-tools 1.0.1-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.0.1-1 amd64 NVIDIA container runtime library
rc nvidia-384 384.130-0ubuntu0.16.04.1 amd64 NVIDIA binary driver - version 384.130
ii nvidia-390 390.30-0ubuntu1 amd64 NVIDIA binary driver - version 390.30
ii nvidia-390-dev 390.30-0ubuntu1 amd64 NVIDIA binary Xorg driver development files
rc nvidia-396 396.44-0ubuntu1 amd64 NVIDIA binary driver - version 396.44
ii nvidia-container-runtime 2.0.0+docker18.09.1-1 amd64 NVIDIA container runtime
ii nvidia-container-runtime-hook 1.4.0-1 amd64 NVIDIA container runtime hook
ii nvidia-docker2 2.0.3+docker18.09.1-1 all nvidia-docker CLI wrapper
ii nvidia-modprobe 390.30-0ubuntu1 amd64 Load the NVIDIA kernel driver and create device files
rc nvidia-opencl-icd-384 384.130-0ubuntu0.16.04.1 amd64 NVIDIA OpenCL ICD
ii nvidia-opencl-icd-390 390.30-0ubuntu1 amd64 NVIDIA OpenCL ICD
rc nvidia-opencl-icd-396 396.44-0ubuntu1 amd64 NVIDIA OpenCL ICD
ii nvidia-prime 0.8.8.2 all Tools to enable NVIDIA's Prime
ii nvidia-settings 396.44-0ubuntu1 amd64 Tool for configuring the NVIDIA graphics driver
Again, everything looks like it's 390.30. There were some packages that had version 390.77, but they were in the rc status. I guess I installed that version and later removed it, so the configuration files were left behind. I purged the configuration files with commands like this:
sudo apt-get remove --purge nvidia-kernel-common-390
Now, there are no packages at all with version 390.77.
$ dpkg --list|grep 390.77
$
I tried reinstalling CUDA, to see if it had been compiled with the wrong version.
$ sudo sh cuda_9.0.176_384.81_linux.run --silent --toolkit --toolkitpath=/usr/local/cuda-9.0 --override
That didn't make any difference.
Finally, I tried running nvidia-smi.
$ LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64 nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
$
All of this is running on Ubuntu 18.04 with Python 3.6.7, and my graphics card is NVIDIA Corporation GM107M [GeForce GTX 960M] (rev a2).
I finally had the idea to look for any files with 390.77 in the name.
$ locate 390.77
/usr/lib/i386-linux-gnu/libcuda.so.390.77
/usr/lib/i386-linux-gnu/libnvcuvid.so.390.77
/usr/lib/i386-linux-gnu/libnvidia-compiler.so.390.77
/usr/lib/i386-linux-gnu/libnvidia-encode.so.390.77
/usr/lib/i386-linux-gnu/libnvidia-fatbinaryloader.so.390.77
/usr/lib/i386-linux-gnu/libnvidia-ml.so.390.77
/usr/lib/i386-linux-gnu/libnvidia-opencl.so.390.77
/usr/lib/i386-linux-gnu/libnvidia-ptxjitcompiler.so.390.77
/usr/lib/i386-linux-gnu/vdpau/libvdpau_nvidia.so.390.77
/usr/lib/x86_64-linux-gnu/libcuda.so.390.77
/usr/lib/x86_64-linux-gnu/libnvcuvid.so.390.77
/usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.390.77
/usr/lib/x86_64-linux-gnu/libnvidia-encode.so.390.77
/usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.390.77
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.390.77
/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.390.77
/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.390.77
/usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.390.77
So there they are! A closer look shows that I must have installed the newer version at some point.
$ ls /usr/lib/i386-linux-gnu/libcuda* -l
lrwxrwxrwx 1 root root 12 Nov 8 13:58 /usr/lib/i386-linux-gnu/libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root 17 Nov 12 14:04 /usr/lib/i386-linux-gnu/libcuda.so.1 -> libcuda.so.390.77
-rw-r--r-- 1 root root 9179124 Jan 31 2018 /usr/lib/i386-linux-gnu/libcuda.so.390.30
-rw-r--r-- 1 root root 9179796 Jul 10 2018 /usr/lib/i386-linux-gnu/libcuda.so.390.77
Where did they come from?
$ dpkg -S /usr/lib/i386-linux-gnu/libcuda.so.390.30
libcuda1-390: /usr/lib/i386-linux-gnu/libcuda.so.390.30
$ dpkg -S /usr/lib/i386-linux-gnu/libcuda.so.390.77
dpkg-query: no path found matching pattern /usr/lib/i386-linux-gnu/libcuda.so.390.77
So the 390.77 no longer belongs to any package. Perhaps I installed the old version and had to force it to overwrite the links.
My plan is to delete the files, then reinstall the packages to set up the links to the correct version. So which packages will I need to reinstall?
$ locate 390.77|sed -e 's/390.77/390.30/'|xargs dpkg -S
Some of the files don't match anything, but the ones that do match are from these packages:
libcuda1-390
nvidia-opencl-icd-390
Crossing my fingers, I delete the version 390.77 files.
locate 390.77|sudo xargs rm
Then I reinstall the packages.
sudo apt-get install --reinstall libcuda1-390 nvidia-opencl-icd-390
Finally, it works!
$ LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64 python -c "import tensorflow; tensorflow.Session()"
2019-02-06 22:13:59.460822: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-02-06 22:13:59.665756: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-02-06 22:13:59.666205: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: GeForce GTX 960M major: 5 minor: 0 memoryClockRate(GHz): 1.176
pciBusID: 0000:01:00.0
totalMemory: 3.95GiB freeMemory: 3.81GiB
2019-02-06 22:13:59.666226: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-06 22:17:21.254445: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-06 22:17:21.254489: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-06 22:17:21.254496: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-06 22:17:21.290992: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3539 MB memory) -> physical GPU (device: 0, name: GeForce GTX 960M, pci bus id: 0000:01:00.0, compute capability: 5.0)
nvidia-smi also works now.
$ LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64 nvidia-smi
Wed Feb 6 22:19:24 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30 Driver Version: 390.30 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 960M Off | 00000000:01:00.0 Off | N/A |
| N/A 45C P8 N/A / N/A | 113MiB / 4046MiB | 6% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 3212 G /usr/lib/xorg/Xorg 113MiB |
+-----------------------------------------------------------------------------+
I rebooted, and the video drivers continued to work. Hurrah!

Cuda version for building xgboost

Trying to get xgboost compiled for GPU. Seems my Cuda install is broken.
~$ cmake .. -DUSE_CUDA=ON
CMake Error at /usr/share/cmake-3.5/Modules/FindPackageHandleStandardArgs.cmake:148 (message):
Could NOT find CUDA: Found unsuitable version "7.5", but required is at
least "8.0" (found /usr)
Call Stack (most recent call first):
/usr/share/cmake-3.5/Modules/FindPackageHandleStandardArgs.cmake:386 (_FPHSA_FAILURE_MESSAGE)
/usr/share/cmake-3.5/Modules/FindCUDA.cmake:949 (find_package_handle_standard_args)
CMakeLists.txt:113 (find_package)
I originally had CUDA 7.5 installed, but afterwards installed CUDA 9.1. I tried to uninstall 7.5, but probably missed something. I ran the following commands to check my Cuda version.
~$ which nvcc
/usr/bin/nvcc
~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Tue_Aug_11_14:27:32_CDT_2015
Cuda compilation tools, release 7.5, V7.5.17
~$ cat /usr/local/cuda/version.txt
CUDA Version 9.1.85
~$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 390.30 Wed Jan 31 22:08:49 PST 2018
GCC version: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.6)
~$ nvidia-smi
Wed Feb 21 00:35:35 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30 Driver Version: 390.30 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:01:00.0 On | N/A |
| 25% 46C P2 56W / 250W | 487MiB / 11175MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
This question suggests clearing cuda files in /usr/bin, and I have cleared the following files.
~$ ls /usr/local/cuda-9.1/bin
bin2c cuda-gdbserver nsight nvprof
computeprof cuda-install-samples-9.1.sh nsight_ee_plugins_manage.sh nvprune
crt cuda-memcheck nvcc nvvp
cudafe cuobjdump nvcc.profile ptxas
cudafe++ fatbinary nvdisasm uninstall_cuda_9.1.pl
cuda-gdb gpu-library-advisor nvlink
~$ cd /usr/bin
~$ ls /usr/local/cuda-9.1/bin | sudo xargs rm
rm: cannot remove 'computeprof': No such file or directory
rm: cannot remove 'crt': No such file or directory
rm: cannot remove 'gpu-library-advisor': No such file or directory
rm: cannot remove 'nsight': No such file or directory
rm: cannot remove 'nsight_ee_plugins_manage.sh': No such file or directory
rm: cannot remove 'nvcc.profile': No such file or directory
rm: cannot remove 'uninstall_cuda_9.1.pl': No such file or directory
Following the question, I added new paths in ~/.bashrc
export PATH=/usr/local/cuda-9.1/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-9.1/lib64\
${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
After these changes, the system correctly references Cuda 9.1. The other diagnostic calls remain unchanged.
~$ which nvcc
/usr/local/cuda-9.1/bin/nvcc
~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Nov__3_21:07:56_CDT_2017
Cuda compilation tools, release 9.1, V9.1.85
However, running cmake .. -DUSE_CUDA=ON still fails, returning the same error. I tried restarting my computer, but it didn't help.
How can I get this to work??
Got it working...
Removed the xgboost directory, re-cloned it from github, and then ran make. Some residual files from the make config clogging stuff?

How to install numpy link against with intel MKL in IBM power8(ppc64le) machine?

I know intel has already written a document for numpy build with mkl. but my machine CPU is power8(IBM) (OS: centos), I download in this link: https://registrationcenter.intel.com/en/products/postregistration/?sn=3VGW-J93Z886P&EmailID=mac16%40tsinghua.edu.cn&Sequence=2115993&dnld=t
but when I run install.sh
[root#power8 intel_mkl]# sh install.sh
install.sh: line 50: [: -lt: unary operator expected
install.sh: line 53: [: -eq: unary operator expected
The IA-32 architecture host installation is no longer supported.
The product cannot be installed on this system.
Please refer to product documentation for more information.
Does intel mkl support power8 machine? and how to build numpy link against mkl in power8 machine exactly?