ERROR (theano.gpuarray): Could not initialize pygpu, support disabled - gpu

I am trying to configure theano 0.9 to use gpu, but got such error.
I use windows 10 with nvidia GeForce 940m and cuda 8. Previously my system works fine with theano 0.8 for gpu computation. I just updated the theano.
ERROR (theano.gpuarray): Could not initialize pygpu, support disabled
Traceback (most recent call last):
File "C:\Users\YL\Anaconda2\lib\site- packages\theano\gpuarray\__init__.py",
line 175, in <module>
use(config.device)
File "C:\Users\YL\Anaconda2\lib\site-packages\theano\gpuarray\__init__.py", line 162, in use
init_dev(device, preallocate=preallocate)
File "C:\Users\YL\Anaconda2\lib\site-packages\theano\gpuarray\__init__.py", line 65, in init_dev
sched=config.gpuarray.sched)
File "pygpu\gpuarray.pyx", line 614, in pygpu.gpuarray.init (pygpu/gpuarray.c:9415)
File "pygpu\gpuarray.pyx", line 566, in pygpu.gpuarray.pygpu_init (pygpu/gpuarray.c:9106)
File "pygpu\gpuarray.pyx", line 1021, in pygpu.gpuarray.GpuContext.__cinit__ (pygpu/gpuarray.c:13468)
GpuArrayException: Error loading library: -1
Without gpu configuration, theano works fine, otherwise it produces the error. I think I must do something wrong with the configuration. My .theanorc file is as follows:
[global]
device = cuda
floatX = float32
[cuda]
root = C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.5
[nvcc]
fastmath = True

I am getting the same (similar) error when I run Theano code. I'm using a laptop with two gpu's (optimus technology). What fixed it for me is to run my python code with enabled gpu like so:
optirun python2 my_code.py
Hope this helps.

Related

Tensorflow TypeError: expected bytes, Descriptor found

I've been following this tutorial for recognising an object using machine learning:
https://www.youtube.com/watch?v=Rgpfk6eYxJA
I've followed all the instructions on what to install and how, including those in this related tutorial:
https://www.youtube.com/watch?v=RplXYjxgZbw
I tried both with their version and the newest available versions of the software. With the exception that I create the virtual environment like this:
conda create -n tensorflow1 pip python=3.6
Because the tensorflow module isn't yet compatible with python 3.7.
After I install all the packages needed, also described here:
https://github.com/EdjeElectronics/TensorFlow-Object-Detection-API-Tutorial-Train-Multiple-Objects-Windows-10
Under 2d. Set up new Anaconda virtual environment
and go through the code in the video, I run into a error when I run
python generate_tfrecord.py --csv_input=images\train_labels.csv --image_dir=images\train --output_path=train.record
which is working in the video at 19:35.
The error is
2019-12-11 10:13:43.410540: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'cudart64_100.dll'; dlerror: cudart64_100.dll not found
Traceback (most recent call last):
File "generate_tfrecord.py", line 17, in <module>
import tensorflow as tf
File "C:\Anaconda\envs\tensorflow1\lib\site-packages\tensorflow\__init__.py", line 98, in <module>
from tensorflow_core import *
File "C:\Anaconda\envs\tensorflow1\lib\site-packages\tensorflow_core\__init__.py", line 40, in <module>
from tensorflow.python.tools import module_util as _module_util
File "<frozen importlib._bootstrap>", line 971, in _find_and_load
File "<frozen importlib._bootstrap>", line 947, in _find_and_load_unlocked
File "C:\Anaconda\envs\tensorflow1\lib\site-packages\tensorflow\__init__.py", line 50, in __getattr__
module = self._load()
File "C:\Anaconda\envs\tensorflow1\lib\site-packages\tensorflow\__init__.py", line 44, in _load
module = _importlib.import_module(self.__name__)
File "C:\Anaconda\envs\tensorflow1\lib\importlib\__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "C:\Anaconda\envs\tensorflow1\lib\site-packages\tensorflow_core\python\__init__.py", line 52, in <module>
from tensorflow.core.framework.graph_pb2 import *
File "C:\Anaconda\envs\tensorflow1\lib\site-packages\tensorflow_core\core\framework\graph_pb2.py", line 16, in <module>
from tensorflow.core.framework import node_def_pb2 as tensorflow_dot_core_dot_framework_dot_node__def__pb2
File "C:\Anaconda\envs\tensorflow1\lib\site-packages\tensorflow_core\core\framework\node_def_pb2.py", line 16, in <module>
from tensorflow.core.framework import attr_value_pb2 as tensorflow_dot_core_dot_framework_dot_attr__value__pb2
File "C:\Anaconda\envs\tensorflow1\lib\site-packages\tensorflow_core\core\framework\attr_value_pb2.py", line 16, in <module>
from tensorflow.core.framework import tensor_pb2 as tensorflow_dot_core_dot_framework_dot_tensor__pb2
File "C:\Anaconda\envs\tensorflow1\lib\site-packages\tensorflow_core\core\framework\tensor_pb2.py", line 16, in <module>
from tensorflow.core.framework import resource_handle_pb2 as tensorflow_dot_core_dot_framework_dot_resource__handle__pb2
File "C:\Anaconda\envs\tensorflow1\lib\site-packages\tensorflow_core\core\framework\resource_handle_pb2.py", line 16, in <module>
from tensorflow.core.framework import tensor_shape_pb2 as tensorflow_dot_core_dot_framework_dot_tensor__shape__pb2
File "C:\Anaconda\envs\tensorflow1\lib\site-packages\tensorflow_core\core\framework\tensor_shape_pb2.py", line 112, in <module>
'__module__' : 'tensorflow.core.framework.tensor_shape_pb2'
TypeError: expected bytes, Descriptor found
This problem is the same that appears in the jupyter kernel when I run the imports that appear in the video at 14:25
How do I fix the
TypeError: expected bytes, Descriptor found
Error?
And what's with
Could not load dynamic library 'cudart64_100.dll'; dlerror: cudart64_100.dll not found
That also appears?
I can also share this with you, in the second tutorial, the one just about installing tensorflow-gpu library, after I create an account for cuDNN and download it as inscribed, I only get a cudnn64_7.dll file in C:\cuda\bin which is in my system path environment variable, just as are
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.1\bin
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.1\libnvvp and
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.1\extras\CUPTI\lib64,
as instructed in the tutorial. As you can see, I have version 10.1 of Cuda and cuDNN and the paths are a bit different. The GPU Driver is also updated.
P.S. in the tensorflow installing tutorial, the test code doesn't work either.
This is all the information I think I have to offer.
I've been trying to solve this problem for 4-5 days at this point (and this is not my first video I watch to get a .record file for an image recognition neural network)
and the solutions for this particular problem offered in TypeError: expected bytes, Descriptor found or any other place on stackoverflow are not useful.
What should I do?
P.S. The tensorflow-gpu version I have is 2.0.0, and it might not be compatible with Cuda and cuDNN. It might be why I only have a cudnn64_7.dll file and not a cudart64_100.dll file. If no one has other solutions, I'll just install tensorflow 1.5 and try the software again.
If someone has another solution however, by all means, post it. I'll post a reply if it works. I'll edit this if it doesn't.
I've followed a different tutorial, however came across the same errors.
In case anyone is still wondering, I've fixed it by updating the tensorflow version from 1.5 originally to 1.15
pip install --ignore-installed --upgrade tensorflow-gpu==1.15.0
This is the official issue where I got the idea from.
As for the second part,
Could not load dynamic library 'cudart64_100.dll'; dlerror: cudart64_100.dll not found
This is an issue with the CUDA drivers. In short, there's compatibility issue between the tensorflow and your GPU. In most cases, don't worry too much, since it will default to using your CPU over GPU for training of a model. In case you really want to use the GPU (for better performance etc) check if it's supported. You can check similarly asked question, or from an official source.
Alternatively, since you've installed CUDA 10.1, as per official documentation, you'll need to upgrade tensorflow 2.1.0 or above to make it work.
Personally, I had to opt to using tensorflow 1.15 over 2.2.0 and installing CUDA 9.0 to make everything run. However, I'm working on a laptop with a mobile 1050 GPU, and no matter what, I couldn't get it to run otherwise.

NotFoundError : ; on tensorflow 1.5 object detection API, running smoothly on 1.4

I recently upgraded one of my small ubuntu (16.04) servers from tensorflow-gpu 1.4 to tensorflow-gpu 1.5 for working with the object detection API. I have git cloned the latest version API that is supposed to work with tensorflow 1.5.
CUDA/cudNN and other tensorflow programs are up and running after the upgrade, and all test-scripts in the object detection API are running fine.
Despite this, when I attempt to run train.py it fails immediately with the following error:
File "/home/arvid/ownCloud/tensorflow/models/research/object_detection/train.py", line 167, in <module> tf.app.run()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 124, in run _sys.exit(main(argv))
File "/home/arvid/ownCloud/tensorflow/models/research/object_detection/train.py", line 107, in main overwrite=True)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/lib/io/file_io.py", line 385, in copy compat.as_bytes(oldpath), compat.as_bytes(newpath), overwrite, status)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__ c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.NotFoundError: ; No such file or directory
This error arise when some input file is missing, but the problem here is that no file is specified in the error.
Usually the missing file is presented between the comma and the semicolon, but in this error it is just a blank space.
I can reproduce the same error on my working server running tensorflow 1.4 by inserting a space between --train_dir= and the path:
--train_dir= {some_path}
But that is not the case here!
Additional info: when I run train.py the 'train' directory is created at the location I specify, so tensorflow seems to be able to identify paths etc..
Any input on how to debug this would be greatly appreciated!!
(Ok, I'm feeling a bit stupid right now...)
The solution was simple - the name of the flags for train.py changed with the update...
It used to be:
--pipeline_config={some_path}
But now it's:
--pipeline_config_path={some_path}
Still, it would be useful with a more informative error message...
Romove some spaces between --train_dir= {some_path} and --pipeline_config_path= {some_path} .
It works for me.

Why do I get an ImportError when I import TensorFlow?

I'm trying to install tensorflow and now I'm stuck with the following warning:
ranj#ranj-Aspire-V3-772G:~$ python3 -c 'import tensorflow as tf; print(tf.__version__)' # for Python 3Traceback (most recent call last):
...
File "/usr/lib/python3.5/imp.py", line 242, in load_module
return load_dynamic(name, filename, file)
File "/usr/lib/python3.5/imp.py", line 342, in load_dynamic
return _load(spec)
ImportError: libnvidia-fatbinaryloader.so.375.39: cannot open shared object file: No such file or directory
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<string>", line 1, in <module>
...
File "/usr/lib/python3.5/imp.py", line 342, in load_dynamic
return _load(spec)
ImportError: libnvidia-fatbinaryloader.so.375.39: cannot open shared object file: No such file or directory
Failed to load the native TensorFlow runtime.
See https://www.tensorflow.org/install/install_sources#common_installation_problems
for some common reasons and solutions. Include the entire stack trace
above this error message when asking for help.
Can someone tell me how I can solve it?
I'm not sure what exactly caused the error you report, but it seems like an issue with CUDA and/or communication with the NVIDIA card in general. Also, I don't know why the installation of the graphics driver failed as you mention in the comments, but if you want to have GPU support, having a working graphics driver is essential obviously. So either the driver you have currently installed works fine already or then you will have to find out why installing a new version of the driver fails.
You could proceed like this:
Make sure your graphics card fulfills the requirements (CUDA compute capability >= 3.0, check the compute capability of your card here).
Make sure your installation of cuDNN and CUDA Toolkit works fine. For this, you could follow the instructions here (point 6.2.2).
If this works fine, it might just be that Tensorflow cannot find the required CUDA libraries. Check this related Stackoverflow post: GPU tensorflow install issue
As a side note: the tutorial you linked in the comments seems to suggest that you have to build Tensorflow from source using bazel, which is in fact not always necessary. I would recommend you take a look at the official installation instructions - those are pretty comprehensive and consider all the details you require for the installation. So if all fails, consider starting from scratch and follow the official tutorial linked above.
I finally fixed it in gnome (ctrl+alt+f3) and logged in with my account and then used:
"sudo init 3"
"sudo -i"
Went to downloads dir and installed the nvidia driver: "sh NVIDIAxxx.run"
"reboot"
In the gnome I can use python3 and import tensorflow
The problem is now that I cannot login to the os in the normal manner. I still can log into the gmome

ValueError: Theano nvcc.flags support only parameter/value pairs without space between them

I tried to configure theano to use gpu in win10 in My laptop with NVIDIA Geforce 940m. I have downloaded and installed vs2012 and cuda7.5 without any error message. I use anaconda2 for python, and install theano using "pip install theano", and create a .theanorc file in my home directory. Everything seems fine. But When I import theano in anaconda prompt, the following error pops up:
ValueError: Theano nvcc.flags support only parameter/value pairs
without space between them. e.g.: '--machine 64' is not supported, but
'--machine=64' is supported. Please add the '=' symbol. nvcc.flags
value is '-LC:\Users\YL\Anaconda2\libs'
But the error message does not consistently show up. There is another error message:
Traceback (most recent call last): File "", line 1, in
File
"C:\Users\YL\Anaconda2\lib\site-packages\theano__init__.py", line 42,
in
from theano.configdefaults import config File "C:\Users\YL\Anaconda2\lib\site-packages\theano\configdefaults.py",
line 43, in
convert=floatX_convert,), File "C:\Users\YL\Anaconda2\lib\site-packages\theano\configparser.py", line
270, in AddConfigVar
configparam.fullname) AttributeError: ('This name is already taken', 'floatX')
For your information, I post the contents of .theanorc file:
[global]
floatX = float32
device = gpu
[cuda]
root = C:\Program Files\NVIDIA Corporation\Installer2\CUDAToolkit_7.5.{57548CFE-7018-485B-A9DD-BC53E4140915}
[nvcc]
fastmath = True
flags = -LC:\Users\YL\Anaconda2\libs
compiler_bindir = C:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\bin
Such error messages also occur to keras. The two packages cannot be used. But when I tried deleting the .theanorc file from my home directory, no more error! It seems there is something wrong with .theanorc or some others involved. Does anybody know how to solve the problem?
I tried to delete the last two sentences for [ncvv], it worked for testing theano gpu file. But in anaconda prompt, it shows:
Using gpu device 0: GeForce 940M (CNMeM is disabled, cuDNN not available)
DEBUG: nvcc STDOUT mod.cu
Creating library C:/Users/YL/AppData/Local/Theano/compiledir_Windows-10-10.0.14393-Intel64_Family_6_Model_78_Stepping_3_GenuineIntel-2.7.12-64/tmp6vtxlj/97496c4d3cf9a06dc4082cc141f918d2.lib and object C:/Users/YL/AppData/Local/Theano/compiledir_Windows-10-10.0.14393-Intel64_Family_6_Model_78_Stepping_3_GenuineIntel-2.7.12-64/tmp6vtxlj/97496c4d3cf9a06dc4082cc141f918d2.exp

Argparse error with TensorFlow's cifar10.py

I get the following error when I run python cifar10.py:
argparse.ArgumentError: argument --batch_size: conflicting option string(s): --batch_size
Here's the full output of the run including a complete trace:
I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcublas.so.7.0 locally
I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcudnn.so.6.5 locally
I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcufft.so.7.0 locally
I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcuda.so locally
I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcurand.so.7.0 locally
Traceback (most recent call last):
File "cifar10.py", line 54, in <module>
"""Number of images to process in a batch.""")
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/default/_flags.py", line 86, in DEFINE_integer
_define_helper(flag_name, default_value, docstring, int)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/default/_flags.py", line 60, in _define_helper
type=flagtype)
File "/usr/lib/python2.7/argparse.py", line 1297, in add_argument
return self._add_action(action)
File "/usr/lib/python2.7/argparse.py", line 1671, in _add_action
self._optionals._add_action(action)
File "/usr/lib/python2.7/argparse.py", line 1498, in _add_action
action = super(_ArgumentGroup, self)._add_action(action)
File "/usr/lib/python2.7/argparse.py", line 1311, in _add_action
self._check_conflict(action)
File "/usr/lib/python2.7/argparse.py", line 1449, in _check_conflict
conflict_handler(action, confl_optionals)
File "/usr/lib/python2.7/argparse.py", line 1456, in _handle_conflict_error
raise ArgumentError(action, message % conflict_string)
argparse.ArgumentError: argument --batch_size: conflicting option string(s): --batch_size
This error seems to come from the following line in cifar10.py: tf.app.flags.DEFINE_integer('batch_size', 128, """Number of images to process in a batch.""")
It seems like the argparse library thinks that I've already defined the option string --batch_size, but I haven't.
[Stack: Amazon g2.2xlarge spot instance, Python 2.7.6]
In the cifr10.py file:
import tensorflow as tf
from tensorflow.models.image.cifar10 import cifar10_input
FLAGS = tf.app.flags.FLAGS
# Basic model parameters.
tf.app.flags.DEFINE_integer('batch_size', 128,
"""Number of images to process in a batch.""")
....
The error is produced by this last statement, which, in the _flags.py file, defines an argparse argument with that name. Evidently at this point the tf.app already has such an argument define.
So we need to look further back at import tensorflow as tf to see how tf.app was created?
What's the Amazon g2.2xlarge? Could that defining batch_size as well?
Looks like tf.app comes from
tensorflow/python/platform/app.py
which in turn gets it from something like
from tensorflow.python.platform.google._app import *
So if you are running this on some google or amazon platform that itself accepts batch_size parameter, it could produce this error.
Another question about cifr10 and the batch_size argument:
How to use "FLAGS" (command line switches) in TensorFlow?
Same error here:
Tensorflow ArgumentError Running CIFAR-10 example
The answer says to use cifar10_train.py,cifar10_eval.py, not cifar10.py.