GPU error: Sagemaker mp.p2.xlarge instance using tensorflow==2.3.0 - tensorflow

I got the following error when trying to train my tensorflow model on sagemaker ml.p2.xlarge instance. I use tensorflow==2.3.0. I wonder whether this is because of the tensorflow version incompatibility with cuda. sagemaker ml.p2.xlarge seems to use cuda 10.0
GPU error:
2020-08-31 08:46:46.429756: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/openmpi/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-08-31 08:47:02.170819: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/openmpi/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-08-31 08:47:02.764874: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1753] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...

This question is probably old, but it falls back on an open issue found at the beginning of choosing which versions of frameworks to use.
The problem does not depend on the type of instance that you specified (which has NVidia GPU).
From the official documentation "Available Deep Learning Containers Images", to date 20/10/2022, precompiled versions higher than 2.2 do not seem to be usable:
Framework
Job Type
Horovod Options
CPU/GPU
Python Version Options
Example URL
TensorFlow 2.2 (Cuda 10.2)
training
Yes
GPU
3.7 (py37)
763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training:2.2.0-gpu-py37-cu102-ubuntu18.04
TensorFlow 2.2
inference
No
GPU
3.7 (py37)
763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference:2.2.0-gpu-py37-cu102-ubuntu18.04
Within the dockerfile that is used to use the container is the instruction to install the libraries that your custom version is missing:
RUN apt-get update && apt-get install -y --no-install-recommends --allow-unauthenticated \
python3-dev \
python3-pip \
python3-setuptools \
ca-certificates \
cuda-command-line-tools-10-1 \
cuda-cudart-dev-10-1 \
cuda-cufft-dev-10-1 \
cuda-curand-dev-10-1 \
cuda-cusolver-dev-10-1 \
cuda-cusparse-dev-10-1 \
curl \
libcudnn7=7.6.2.24-1+cuda10.1 \
# TensorFlow doesn't require libnccl anymore but Open MPI still depends on it
libnccl2=2.4.7-1+cuda10.1 \
libgomp1 \
libnccl-dev=2.4.7-1+cuda10.1 \
....
Then you can install the required libraries from your custom version directly with a requirements.txt file or run the install command directly in the training script.
If there are no special project requirements, I recommend using the precompiled versions of sagemaker. Otherwise, build a docker image from scratch instead of installing libraries this way..

Related

Could not load dynamic library 'libnvinfer.so.7'

I know that this question has been asked a lot, but none of the suggestions seem to work, probably since my setup is somewhat different:
Ubuntu 22.04
python 3.10.8
tensorflow 2.11.0
cudatoolkit 11.2.2
cudnn 8.1.0.77
nvidia-tensorrt 8.4.3.1
nvidia-pyindex 1.0.9
Having created a conda environment 'tf', in the directory home/dan/anaconda3/envs/tf/lib/python3.10/site-packages/tensorrt I have
libnvinfer_builder_resource.so.8.4.3
libnvinfer_plugin.so.8
libnvinfer.so.8
libnvonnxparser.so.8
libnvparsers.so.8
tensorrt.so
When running python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))" I get
tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7';
dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory;
LD_LIBRARY_PATH: :/home/dan/anaconda3/envs/tf/lib
tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7';
dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory;
LD_LIBRARY_PATH: :/home/dan/anaconda3/envs/tf/lib
tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
I'm guessing I should downgrade nvidia-tensorrt, but nothing I've tried seems to work, any advice would be much appreciated.
Solution: follow the steps listed here https://github.com/tensorflow/tensorflow/issues/57679#issuecomment-1249197802.
Add the following to ~/.bashrc (for the conda envs as described in my scenario):
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/dan/anaconda3/lib/
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/dan/anaconda3/lib/python3.8/site-packages/tensorrt/
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/dan/anaconda3/envs/tf/lib
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/dan/anaconda3/envs/tf/lib/python3.8/site-packages/tensorrt/
For me the setting a symbolic link from libnvinfer version 7 to 8 worked:
# the follwoing path will be different for you - depending on your install method
$ cd env/lib/python3.10/site-packages/tensorrt
# create symbolic links
$ ln -s libnvinfer_plugin.so.8 libnvinfer_plugin.so.7
$ ln -s linvinfer.so.8 libnvinfer.so.7
# add tensorrt to library path
$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:~/env/lib/python3.10/site-packages/tensorrt/

Location of base directory of OpenMPI for cmake

I am trying to compile Trilinos with MPI capabilities. But to specify the cmake command, i need to also specify the MPI base directory:
cmake \
-DTPL_ENABLE_MPI=ON \
-DMPI_BASE_DIR:FILEPATH="" \
-DTrilinos_ENABLE_PyTrilinos:BOOL=ON \
-DTrilinos_ENABLE_ALL_PACKAGES=ON \
-DTrilinos_ENABLE_TESTS:BOOL=ON \
-DBUILD_SHARED_LIBS:BOOL=ON \
-DCMAKE_INSTALL_PREFIX:STRING="$HOME/trilinos-install" \
$SOURCE_DIR
However, I am unable to find any base directory even though MPI is installed on my machine. When i enter commands like mpirun --version, I get:
mpirun (Open MPI) 2.1.1
or ompi_info:
Package: Open MPI buildd#lcy01-amd64-009 Distribution
Open MPI: 2.1.1
Open MPI repo revision: v2.1.0-100-ga2fdb5b
Open MPI release date: May 10, 2017
Open RTE: 2.1.1
...
I am running Ubuntu 18.04 LTS on WSL if that is useful info.
The command "which mpich" can give you the default directory of mpi installation.
Another way is to use mpic++ as the compiler in the CMake.

how to cross compile tensorflow lite for arm64(linux) by using bazel

I'm trying to build the tensorflow lite for 'arm64-v8a' with linux on an amd64 with linux.I follow the guide to build the library libtensorflow-lite.a.but i found there remains a long tail of TensorFlow ops that are not yet natively supported by TensorFlow Lite,so i want Select TensorFlow operators to use in TensorFlow Lite. from the guide i should Recompile the library, I use the command bazel build -c opt //tensorflow/lite:libtensorflowlite.so --cpu=arm64-v8a, I got the following error:
/home/wang/.cache/bazel/_bazel_wang/c72f4772665ac4cb0690414b07635968/external/local_config_cc/BUILD:45:1: in cc_toolchain_suite rule #local_config_cc//:toolchain: cc_toolchain_suite '#local_config_cc//:toolchain' does not contain a toolchain for cpu 'arm64-v8a'
I'm new to bazel and cannot find a detailed walkthrough.what should i do?

The TensorFlow library wasn't compiled to use AVX - AVX2

I'm new to Tensorflow.
I am using a 64 bit version of Windows 10 and I would like to install Tensorflow for the CPU.
I don't remember the exact steps that I followed to install it, however when I checked for the installation using:
import tensorflow as tf
hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
print(sess.run(hello))
I have the following output:
2017-10-18 09:56:21.656601: W C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\36\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-10-18 09:56:21.656984: W C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\36\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
b'Hello, TensorFlow!'
I am running python in Sublime Text 3 using the package SublimeREPL.
I tried to search these errors and found out that it means that the tensorflow is built without these instructions which could improve performances for the CPU. I also found the code to hide these warnings, but I actually I want to use these instructions.
The code that I found that enables this is:
bazel build -c opt --copt=-mavx --copt=-mavx2 --copt=-msse4.2 --copt=-msse4.1 --copt=-msse3 --copt=-mfma -k //tensorflow/tools/pip_package:build_pip_package
but I got this output:
ERROR: Skipping '//tensorflow/tools/pip_package:build_pip_package': no such package 'tensorflow/tools/pip_package': BUILD file not found on package path.
WARNING: Target pattern parsing failed. Continuing anyway.
INFO: Found 0 targets...
ERROR: command succeeded, but there were errors parsing the target pattern.
INFO: Elapsed time: 8,147s, Critical Path: 0,02s
How can I solve this problem?
Lastly, I don't understand what pip, wheel and bazel are so I need a step by step instructions.
Thank you a lot!
if you want to download TensorFlow source, compile+install, use this link. If you want to download binaries, then use this link.

syntaxnet ./configure error

I was trying to using syntaxnet and I have finished most of processes. Upgrade bazel version to 0.43 in case of errors (Ubuntu 16.04 Ver, Anaconda python 2.7).
However, I am having a troubles with ./configure part. I am reading the official instruction via tensorflow github.
git clone --recursive https://github.com/tensorflow/models.git
cd models/syntaxnet/tensorflow
**./configure**
cd ..
bazel test syntaxnet/... util/utf8/...
# On Mac, run the following:
bazel test --linkopt=-headerpad_max_install_names \
syntaxnet/... util/utf8/...
Following logs will help you to understand what’s going on my machine. Thanks for the advice
Please specify the location of python. [Default is /home/ryan/anaconda2/bin/python]:
Do you wish to build TensorFlow with Google Cloud Platform support? [y/N] n
No Google Cloud Platform support will be enabled for TensorFlow
Do you wish to build TensorFlow with Hadoop File System support? [y/N] n
No Hadoop File System support will be enabled for TensorFlow
Found possible Python library paths:
/home/ryan
/home/ryan/pynaoqi-python2.7
/home/ryan/anaconda2/lib/python2.7/site-packages
Please input the desired Python library path to use. Default is [/home/ryan]
/home/ryan/anaconda2/lib/python2.7/site-packages
Do you wish to build TensorFlow with GPU support? [y/N] y
GPU support will be enabled for TensorFlow
Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]:
Please specify the Cuda SDK version you want to use, e.g. 7.0. [Leave empty to use system default]: 8.0
Please specify the location where CUDA 8.0 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:
Please specify the Cudnn version you want to use. [Leave empty to use system default]: 5.0
Please specify the location where cuDNN 5.0 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:
Invalid path to cuDNN toolkit. Neither of the following two files can be found:
/usr/local/cuda-8.0/lib64/libcudnn.so.5.0
/usr/local/cuda-8.0/libcudnn.so.5.0
.5.0
Please specify the Cudnn version you want to use. [Leave empty to use system default]:
Please specify the location where cuDNN library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:
libcudnn.so resolves to libcudnn.5
Please specify a list of comma-separated Cuda compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size.
[Default is: "3.5,5.2"]:
INFO: Options provided by the client:
Inherited 'common' options: --isatty=1 --terminal_columns=120
INFO: Reading options for 'clean' from /home/ryan/git_ryan/models/syntaxnet/tensorflow/tools/bazel.rc:
Inherited 'build' options: --force_python=py2 --host_force_python=py2 --python2_path=/home/ryan/anaconda2/bin/python --define=use_fast_cpp_protos=true --define=allow_oversize_protos=true --define PYTHON_BIN_PATH=/home/ryan/anaconda2/bin/python --spawn_strategy=standalone --genrule_strategy=standalone
**INFO: Reading options for 'clean' from /etc/bazel.bazelrc:
Inherited 'build' options: --action_env=PATH --action_env=LD_LIBRARY_PATH --action_env=TMPDIR --test_env=PATH --test_env=LD_LIBRARY_PATH
Unrecognized option: --action_env=PATH
ERROR: /home/ryan/git_ryan/models/syntaxnet/tensorflow/tensorflow/tensorflow.bzl:568:26: Traceback (most recent call last):
File "/home/ryan/git_ryan/models/syntaxnet/tensorflow/tensorflow/tensorflow.bzl", line 562
rule(attrs = {"srcs": attr.label_list..."), <3 more arguments>)}, <2 more arguments>)
File "/home/ryan/git_ryan/models/syntaxnet/tensorflow/tensorflow/tensorflow.bzl", line 568, in rule
attr.label_list(cfg = "data", allow_files = True)
expected ConfigurationTransition or NoneType for 'cfg' while calling label_list but got string instead: data.
ERROR: com.google.devtools.build.lib.packages.BuildFileContainsErrorsException: error loading package '': Extension file 'tensorflow/tensorflow.bzl' has errors.
Configuration finished**
I think the version of your bazel is too high for Syntaxnet. you can try bazel-0.3.1 please.