Cannot compile nms_cuda in mmdetection using colaboratory - google-colaboratory

/usr/local/lib/python3.6/dist-packages/torch/include/ATen/core/TensorBody.h:262:30: note: declared here
DeprecatedTypeProperties & type() const {
^~~~
mmdet/ops/nms/src/nms_cuda.cpp:4:23: error: ‘AT_CHECK’ was not declared in this scope
#define CHECK_CUDA(x) AT_CHECK(x.type().is_cuda(), #x, " must be a CUDAtensor ")
^
mmdet/ops/nms/src/nms_cuda.cpp:4:23: note: in definition of macro ‘CHECK_CUDA’
#define CHECK_CUDA(x) AT_CHECK(x.type().is_cuda(), #x, " must be a CUDAtensor ")
^~~~~~~~
mmdet/ops/nms/src/nms_cuda.cpp:4:23: note: suggested alternative: ‘DCHECK’
#define CHECK_CUDA(x) AT_CHECK(x.type().is_cuda(), #x, " must be a CUDAtensor ")
^
mmdet/ops/nms/src/nms_cuda.cpp:4:23: note: in definition of macro ‘CHECK_CUDA’
#define CHECK_CUDA(x) AT_CHECK(x.type().is_cuda(), #x, " must be a CUDAtensor ")
^~~~~~~~
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
Now all the info I get:
sys.version 8.2
/usr/bin/gcc 7
gcc versio 7.5
Is it normal?
What happened? Cuda error? gcc error?
How to solve. I'm hurried to use it to run mmdetection program.

The solution is:
conda install pytorch cudatoolkit==10.0 torchvision -c pytorch -y
if using branch earlier than v2.0.0.
or:
conda install pytorch cudatoolkit torchvision -c pytorch -y
if branch is equal to v2.0.0.
Took me 3 days to figure this out. Basically if you install cudatoolkit==10.0 this will force pytorch to be downgraded to pytorch version 1.4 which makes the whole thing work. Its not at all obvious. And is in my opinion a bug in the mmdetection installation guide.

AT_CHECK is not used in the new version of pytorch.
Solution:
Try to replace all AT_CHECK with TORCH_CHECK

Related

GPU error: Sagemaker mp.p2.xlarge instance using tensorflow==2.3.0

I got the following error when trying to train my tensorflow model on sagemaker ml.p2.xlarge instance. I use tensorflow==2.3.0. I wonder whether this is because of the tensorflow version incompatibility with cuda. sagemaker ml.p2.xlarge seems to use cuda 10.0
GPU error:
2020-08-31 08:46:46.429756: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/openmpi/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-08-31 08:47:02.170819: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/openmpi/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-08-31 08:47:02.764874: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1753] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
This question is probably old, but it falls back on an open issue found at the beginning of choosing which versions of frameworks to use.
The problem does not depend on the type of instance that you specified (which has NVidia GPU).
From the official documentation "Available Deep Learning Containers Images", to date 20/10/2022, precompiled versions higher than 2.2 do not seem to be usable:
Framework
Job Type
Horovod Options
CPU/GPU
Python Version Options
Example URL
TensorFlow 2.2 (Cuda 10.2)
training
Yes
GPU
3.7 (py37)
763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training:2.2.0-gpu-py37-cu102-ubuntu18.04
TensorFlow 2.2
inference
No
GPU
3.7 (py37)
763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference:2.2.0-gpu-py37-cu102-ubuntu18.04
Within the dockerfile that is used to use the container is the instruction to install the libraries that your custom version is missing:
RUN apt-get update && apt-get install -y --no-install-recommends --allow-unauthenticated \
python3-dev \
python3-pip \
python3-setuptools \
ca-certificates \
cuda-command-line-tools-10-1 \
cuda-cudart-dev-10-1 \
cuda-cufft-dev-10-1 \
cuda-curand-dev-10-1 \
cuda-cusolver-dev-10-1 \
cuda-cusparse-dev-10-1 \
curl \
libcudnn7=7.6.2.24-1+cuda10.1 \
# TensorFlow doesn't require libnccl anymore but Open MPI still depends on it
libnccl2=2.4.7-1+cuda10.1 \
libgomp1 \
libnccl-dev=2.4.7-1+cuda10.1 \
....
Then you can install the required libraries from your custom version directly with a requirements.txt file or run the install command directly in the training script.
If there are no special project requirements, I recommend using the precompiled versions of sagemaker. Otherwise, build a docker image from scratch instead of installing libraries this way..

maxpooling error in Tenssoflow ;Check failed: dnnPoolingCreateForward_F32(.<parameter list>.) == E_SUCCESS (-127 vs. 0)

I am learning tesnorflow from this blog:
http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/
The code i am running is :
https://github.com/dennybritz/cnn-text-classification-tf/blob/master/train.py
I have installed tensorflow from sourcse in a virtual enviroment,in CPU only enviroment using followinbg bazel build command: bazel build --config=mkl ...
here is the exact error:
"2018-01-16 03:15:27.783040: F tensorflow/core/kernels/mkl_maxpooling_op.cc:157] Check failed: dnnPoolingCreateForward_F32( &prim_pooling_fwd, primAttr, algorithm, lt_user_input, params.kernel_size, params.kernel_stride, params.in_offset, dnnBorderZerosAsymm) == E_SUCCESS (-127 vs. 0)
Aborted
"
I have debugged error to the line where sess.run is written, i have beleived it has something to do it mkl_maxpooling, as i had installed tensorflow with mkl optimization of INTEL cpu's
Given below are the steps that I followed:
Build tensorflow 1.4 from source with mkl as mentioned in the question
Cloned the git repo "https://github.com/dennybritz/cnn-text-classification-tf.git"
Ran "python train.py" from "cnn-text-classification-tf" directory(created from git clone)
Code ran without any errors. So it seems like the tensorflow was not properly built from the source. Please confirm that there were no errors while building tensorflow from source.

python tensorflow module dependency on glibc

I successfully build bazel and tensorflow from the source code, but when using the tensorflow module I am getting the following error:
./new_python/bin/python
>>>import tensorflow as tf
Error MSG: File "/home/niraj/Ansible/new_python/lib/python2.7/site-packages/‌​tensorflow/python/py‌​wrap_tensorflow.py", line 28, in <module> _pywrap_tensorflow = swig_import_helper()
ImportError: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by /home/niraj/Ansible/new_python/lib/python2.7/site-packages/t‌​ensorflow/python/_py‌​wrap_tensorflow.so)
I am using RHEL6 machine. Any idea how to fix this ?
I found two bug reports on github regarding this very problem
https://github.com/tensorflow/tensorflow/issues/110
https://github.com/bazelbuild/bazel/issues/760
At least I get the impression that getting tensorflow to work on RHEL 6 is at least 'difficult' - as some claim in those two bugreports that they got it to work, with some limitations - if not, at least for now, impossible.
At least for Ubuntu 12.04 and CentOS 6.7 there are solutions. The 2nd answer (mentions CentOS) should work on RHEL 6 as well.
Old/First answer:
According to the link I gathered from this answer, RHEL 6 ships with libc 2.12, not 2.14.
You would have to compile the tensorflow stuff again and link it to an existing libc 2.14 on your system. I'm not quite sure how you were able to compile it without already having libc 2.14 somewhere on your system.
What made the trick for me was updating glibc (in my case to 2.17 version) by:
wget http://copr-be.cloud.fedoraproject.org/results/mosquito/myrepo-el6/epel-6-x86_64/glibc-2.17-55.fc20/glibc-2.17-55.el6.x86_64.rpm
wget http://copr-be.cloud.fedoraproject.org/results/mosquito/myrepo-el6/epel-6-x86_64/glibc-2.17-55.fc20/glibc-common-2.17-55.el6.x86_64.rpm
wget http://copr-be.cloud.fedoraproject.org/results/mosquito/myrepo-el6/epel-6-x86_64/glibc-2.17-55.fc20/glibc-devel-2.17-55.el6.x86_64.rpm
wget http://copr-be.cloud.fedoraproject.org/results/mosquito/myrepo-el6/epel-6-x86_64/glibc-2.17-55.fc20/glibc-headers-2.17-55.el6.x86_64.rpm
sudo rpm -Uvh glibc-2.17-55.el6.x86_64.rpm \
glibc-common-2.17-55.el6.x86_64.rpm \
glibc-devel-2.17-55.el6.x86_64.rpm \
glibc-headers-2.17-55.el6.x86_64.rpm --force --nodeps
I link original answer

Tensorflow compilation error with latest tensorflow source on MacOS

I am trying to build the tensorflow source on my Mac OSx Yosemite (10.10.5). After I run this command
bazel build --config=opt //tensorflow/tools/pip_package:build_pip_package
I get this error
C++ compilation of rule '//tensorflow/core:candidate_sampling_ops_op_lib' failed: cc_wrapper.sh failed: error executing command external/local_config_cc/cc_wrapper.sh -U_FORTIFY_SOURCE -fstack-protector -Wall -Wthread-safety -Wself-assign -fcolor-diagnostics -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG ... (remaining 95 argument(s) skipped): com.google.devtools.build.lib.shell.BadExitStatusException: Process exited with status 1.
tensorflow/core/ops/candidate_sampling_ops.cc:392:7: error: return type 'tensorflow::Status' must match previous return type 'const ::tensorflow::Status' when lambda expression has unspecified explicit return type
return Status::OK();
^
tensorflow/core/ops/candidate_sampling_ops.cc:376:17: error: no viable conversion from 'tensorflow::(lambda at tensorflow/core/ops/candidate_sampling_ops.cc:376:17)' to 'tensorflow::Status (*)(shape_inference::InferenceContext )'
.SetShapeFn([](InferenceContext c) {
What may I be doing wrong ?
(outdated but still relevant for this version of TF)
The latest version of tensorflow is NOT compileable/working for mac os x.
Here is my script to get tensorflow working on mac-osx sierra tensorflow 1.0 on mac-osx sierra i7 no gpu. I'm still working on getting SSE and such to compile correctly and a later version of tensorflow - but whatever. Tensorflow is not friently with macs - but DL4J is!
UPDATE:
You shouldn't need to update from Yosemite. I was able to get r1.3 to compile with SSE and AVX! So the 'latest release' at time of writing has known issues - r1.3 is the latest stable build. I've included the script to do a proper build below, but also including http://www.josephmiguel.com/building-tensorflow-1-3-from-source-on-mac-osx-sierra-macbook-pro-i7-with-sse-and-avx/ for all the details on the matter.
one time install
install anaconda3 pkg # manually download this and install the package
conda update conda
conda create -n dl python=3.6 anaconda
source activate dl
cd /
brew install bazel
pip install six numpy wheel
pip install –upgrade https://storage.googleapis.com/tensorflow/mac/cpu/protobuf-3.1.0-cp35-none-macosx_10_11_x86_64.whl
sudo -i
cd /
rm -rf tensorflow # if rerunning the script
cd /
git clone https://github.com/tensorflow/tensorflow
Step 1
cd /tensorflow
git checkout r1.3 -f
cd /
chmod -R 777 tensorflow
cd /tensorflow
./configure # accept all default settings
Step 2
// https://stackoverflow.com/questions/41293077/how-to-compile-tensorflow-with-sse4-2-and-avx-instructions
bazel build –config=opt –copt=-mavx –copt=-mavx2 –copt=-mfma //tensorflow/tools/pip_package:build_pip_package
Step 3
bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
pip install /tmp/tensorflow_pkg/tensorflow-1.0.1-cp36-cp36m-macosx_10_7_x86_64.whl
Step 4
cd ~
ipython
Step 5
import tensorflow as tf
hello = tf.constant(‘Hello, TensorFlow!’)
sess = tf.Session()
print(sess.run(hello))
Step 6
pip uninstall /tmp/tensorflow_pkg/tensorflow-1.0.1-cp36-cp36m-macosx_10_7_x86_64.whl

g++ error on import of Theano on Windows 7

I'm attempting to get setup with a proper g++ installation according to the theano installation guide. I've previously had theano working with the python only implementation. I'm using the bleeding edge version of theano from their git repo on python 3.4. I've tried using the theano suggested TDM-GCC-64 method as well as MinGW, and both result in the exact same error. (copied as readable as possible)
Problem occurred during compilation with the command line below:
C:\MinGW\bin\g++.exe -shared -g -march=skylake -mmmx -mno-3dnow -msse -msse2 -msse3
-mssse3 -mno-sse4a -mcx16 -msahf -mmovbe -maes -mno-sha -mpclmul -mpopcnt
-mabm -mno-lwp -mfma -mno-fma4 -mno-xop -mbmi -mbmi2 -mno-tbm -mavx
-mavx2 -msse4.2 -msse4.1 -mlzcnt -mrtm -mhle -mrdrnd -mf16c -mfsgsbase
-mrdseed -mprfchw -madx -mfxsr -mxsave -mxsaveopt -mno-avx512f
-mno-avx512er -mno-avx512cd -mno-avx512pf -mno-prefetchwt1 -mclflushopt
-mxsavec -mxsaves -mno-avx512dq -mno-avx512bw -mno-avx512vl
-mno-avx512ifma -mno-avx512vbmi -mno-clwb -mno-pcommit -mno-mwaitx
-mno-clzero -mno-pku --param l1-cache-size=32 --param
l1-cache-line-size=64 --param l2-cache-size=8192 -mtune=skylake
-DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION -m64 -DMS_WIN64
-IC:\Python34_64bit\lib\site-packages\numpy\core\include
IC:\Python34_64bit\include -IC:\Python34_64bit\lib\site-packages\theano\gof
-o C:\Users\Jwely\AppData\Local\Theano\compiledir_Windows-7-6.1.7601-SP1-Intel64_Family_6_Model_94_Stepping_3_GenuineIntel-3.4.4-64\lazylinker_ext\lazylinker_ext.pyd
C:\Users\Jwely\AppData\Local\Theano\compiledir_Windows-7-6.1.7601-SP1-Intel64_Family_6_Model_94_Stepping_3_GenuineIntel-3.4.4-64\lazylinker_ext\mod.cpp
-LC:\Python34_64bit\libs -LC:\Python34_64bit -lpython34
In file included from c:\mingw\include\c++\6.1.0\math.h:36:0,
from C:\Python34_64bit\include/pyport.h:328,
from C:\Python34_64bit\include/Python.h:50,
from C:\Users\Jwely\AppData\Local\Theano\compiledir_Windows-7-6.1.7601-SP1-Intel64_Family_6_Model_94_Stepping_3_GenuineIntel-3.4.4-64\lazylinker_ext\mod.cpp:1:
c:\mingw\include\c++\6.1.0\cmath:1133:11: error: '::hypot' has not been declared
using ::hypot;
^~~~~
It may be worth noting that before it prints this error, it prints an entire file worth of code, you can find the entire error output here
I'm not sure what to try next, I've followed the directions twice, used a couple different installation methods for some dependencies, and made sure to clean up my system path between each attempt and reboot.
This worked for me:
Go to your user folder: C:/Users/[username]
Create .theanorc file if it doesn't already exist
makes sure it includes the lines:
[gcc]
cxxflags = -D_hypot=hypot
"Error: '::hypot' has not been declared" in cmath while trying to embed Python
Error building Boost 1.49.0 with GCC 4.7.0
my solution is comment out all the
#define hypot _hypot
macro in the pyconfig.h file
This worked for me
Go to System properties/Advance system setting
Add your MinGW installation path, if already added and looks something like C:\{your MingW installation}\bin
change it to C:\{your MingW installation}
The answers above are probably a better, more permanent solution. For a quick fix, the following worked for me:
import theano
theano.config.gcc.cxxflags = "-D_hypot=hypot"
...with Windows 10, Anaconda 4.4, Python 2.7, Theano v0.10.0.dev1, m2w64-toolchain v5.3.0
if you can't create a file with name .theanorc.
You can use this code in console but before open cmd in C:/Users/[username] and then write python and then paste below code there:
import os
with open(os.path.join(os.environ["USERPROFILE"], ".theanorc"), "w") as f:
f.write("[gcc]\ncxxflags = -D_hypot=hypot")
First, uninstall all Theano versions.
Then:
pip install pydot-ng
conda install mingw libpython
pip install git+https://github.com/Theano/Theano.git#egg=Theano