Location of base directory of OpenMPI for cmake - cmake

I am trying to compile Trilinos with MPI capabilities. But to specify the cmake command, i need to also specify the MPI base directory:
cmake \
-DTPL_ENABLE_MPI=ON \
-DMPI_BASE_DIR:FILEPATH="" \
-DTrilinos_ENABLE_PyTrilinos:BOOL=ON \
-DTrilinos_ENABLE_ALL_PACKAGES=ON \
-DTrilinos_ENABLE_TESTS:BOOL=ON \
-DBUILD_SHARED_LIBS:BOOL=ON \
-DCMAKE_INSTALL_PREFIX:STRING="$HOME/trilinos-install" \
$SOURCE_DIR
However, I am unable to find any base directory even though MPI is installed on my machine. When i enter commands like mpirun --version, I get:
mpirun (Open MPI) 2.1.1
or ompi_info:
Package: Open MPI buildd#lcy01-amd64-009 Distribution
Open MPI: 2.1.1
Open MPI repo revision: v2.1.0-100-ga2fdb5b
Open MPI release date: May 10, 2017
Open RTE: 2.1.1
...
I am running Ubuntu 18.04 LTS on WSL if that is useful info.

The command "which mpich" can give you the default directory of mpi installation.
Another way is to use mpic++ as the compiler in the CMake.

Related

Setting up on Macbook Pro M1 Tenserflow with OpenCV, Scipy, Scikit-learn

I think I read pretty much most of the guides on setting up tensorflow, tensorflow-hub, object detection on Mac M1 on BigSur v11.6. I managed to figure out most of the errors after more than 2 weeks. But I am stuck at OpenCV setup. I tried to compile it from source but seems like it can't find the modules from its core package so constantly can't make the file after the successful cmake build. It fails at different stages, crying for different libraries, despite they are there but max reached 31% after multiple cmake and deletion of the build folder or the cmake cash file. So I am not sure what to do in order to make successfully the file.
I git cloned and unzipped the opencv-4.5.0 and opencv_contrib-4.5.0 in my miniforge3 directory. Then I created a folder "build" in my opencv-4.5.0 folder and the cmake command I use in it is (my miniforge conda environment is called silicon and made sure I am using arch arm64 in bash environment):
cmake -DCMAKE_SYSTEM_PROCESSOR=arm64 -DCMAKE_OSX_ARCHITECTURES=arm64 -DWITH_OPENJPEG=OFF -DWITH_IPP=OFF -D CMAKE_BUILD_TYPE=RELEASE -D CMAKE_INSTALL_PREFIX=/usr/local -D OPENCV_EXTRA_MODULES_PATH=/Users/adi/miniforge3/opencv_contrib-4.5.0/modules -D PYTHON3_EXECUTABLE=/Users/adi/miniforge3/envs/silicon/bin/python3.8 -D BUILD_opencv_python2=OFF -D BUILD_opencv_python3=ON -D INSTALL_PYTHON_EXAMPLES=ON -D INSTALL_C_EXAMPLES=OFF -D OPENCV_ENABLE_NONFREE=ON -D BUILD_EXAMPLES=ON /Users/adi/miniforge3/opencv-4.5.0
So it cries like:
[ 20%] Linking CXX shared library ../../lib/libopencv_core.dylib
[ 20%] Built target opencv_core
make: *** [all] Error 2
or also like in another tries was initially asking for calib3d or dnn but those libraries are there in the main folder opencv-4.5.0.
The other way I try to install openCV is with conda:
conda install opencv
But then when I test with
python -c "import cv2; cv2.__version__"
it seems like it searches for the ffmepg via homebrew (I didn't install any of these via homebrew but with conda). So it complained:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/Users/adi/miniforge3/envs/silicon/lib/python3.8/site-packages/cv2/__init__.py", line 5, in <module>
from .cv2 import *
ImportError: dlopen(/Users/adi/miniforge3/envs/silicon/lib/python3.8/site-packages/cv2/cv2.cpython-38-darwin.so, 2): Library not loaded: /opt/homebrew/opt/ffmpeg/lib/libavcodec.58.dylib
Referenced from: /Users/adi/miniforge3/envs/silicon/lib/python3.8/site-packages/cv2/cv2.cpython-38-darwin.so
Reason: image not found
Though I have these libs, so when I searched with: find /usr/ -name 'libavcodec.58.dylib' I could find many locations:
find: /usr//sbin/authserver: Permission denied
find: /usr//local/mysql-8.0.22-macos10.15-x86_64/keyring: Permission denied
find: /usr//local/mysql-8.0.22-macos10.15-x86_64/data: Permission denied
find: /usr//local/hw_mp_userdata/Internet_Manager/OnlineUpdate: Permission denied
/usr//local/lib/libavcodec.58.dylib
/usr//local/Cellar/ffmpeg/4.4_2/lib/libavcodec.58.dylib
(silicon) MacBook-Pro:opencv-4.5.0 adi$ ln -s /usr/local/Cellar/ffmpeg/4.4_2/lib/libavcodec.58.dylib /opt/homebrew/opt/ffmpeg/lib/libavcodec.58.dylib
ln: /opt/homebrew/opt/ffmpeg/lib/libavcodec.58.dylib: No such file or directory
One of the guides said to install homebrew also in arm64 env, so I did it with:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
export PATH="/opt/homebrew/bin:/usr/local/bin:$PATH"
alias ibrew='arch -x86_64 /usr/local/bin/brew' # create brew for intel (ibrew) and arm/ silicon
Not sure if that is affecting it but seems like it didn't do anything because still uses /opt/homebrew/ instead of /usr/local/.
So any help would be highly appreciated if I can make any of the ways work. Ultimately I want to use Tenserflow Model Zoo Object Detection models. So all the other dependencies seems fine (for now) besides either OpenCV not working or if it is working with conda install then it seems that scipy and scikit-learn don't work.
In my case I also had lot of trouble trying to install both modules. I finally managed to do so but to be honest not really sure how and why. I leave below the requirements in case you might want to recreate the environment that worked in my case. You should have the conda Miniforge 3 installed :
# This file may be used to create an environment using:
# $ conda create --name <env> --file <this file>
# platform: osx-arm64
absl-py=1.0.0=pypi_0
astunparse=1.6.3=pypi_0
autocfg=0.0.8=pypi_0
blas=2.113=openblas
blas-devel=3.9.0=13_osxarm64_openblas
boto3=1.22.10=pypi_0
botocore=1.25.10=pypi_0
c-ares=1.18.1=h1a28f6b_0
ca-certificates=2022.2.1=hca03da5_0
cachetools=5.0.0=pypi_0
certifi=2021.10.8=py39hca03da5_2
charset-normalizer=2.0.12=pypi_0
cycler=0.11.0=pypi_0
expat=2.4.4=hc377ac9_0
flatbuffers=2.0=pypi_0
fonttools=4.31.1=pypi_0
gast=0.5.3=pypi_0
gluoncv=0.10.5=pypi_0
google-auth=2.6.0=pypi_0
google-auth-oauthlib=0.4.6=pypi_0
google-pasta=0.2.0=pypi_0
grpcio=1.42.0=py39h95c9599_0
h5py=3.6.0=py39h7fe8675_0
hdf5=1.12.1=h5aa262f_1
idna=3.3=pypi_0
importlib-metadata=4.11.3=pypi_0
jmespath=1.0.0=pypi_0
keras=2.8.0=pypi_0
keras-preprocessing=1.1.2=pypi_0
kiwisolver=1.4.0=pypi_0
krb5=1.19.2=h3b8d789_0
libblas=3.9.0=13_osxarm64_openblas
libcblas=3.9.0=13_osxarm64_openblas
libclang=13.0.0=pypi_0
libcurl=7.80.0=hc6d1d07_0
libcxx=12.0.0=hf6beb65_1
libedit=3.1.20210910=h1a28f6b_0
libev=4.33=h1a28f6b_1
libffi=3.4.2=hc377ac9_2
libgfortran=5.0.0=11_1_0_h6a59814_26
libgfortran5=11.1.0=h6a59814_26
libiconv=1.16=h1a28f6b_1
liblapack=3.9.0=13_osxarm64_openblas
liblapacke=3.9.0=13_osxarm64_openblas
libnghttp2=1.46.0=h95c9599_0
libopenblas=0.3.18=openmp_h5dd58f0_0
libssh2=1.9.0=hf27765b_1
llvm-openmp=12.0.0=haf9daa7_1
markdown=3.3.6=pypi_0
matplotlib=3.5.1=pypi_0
mxnet=1.6.0=pypi_0
ncurses=6.3=h1a28f6b_2
numpy=1.21.2=py39hb38b75b_0
numpy-base=1.21.2=py39h6269429_0
oauthlib=3.2.0=pypi_0
openblas=0.3.18=openmp_h3b88efd_0
opencv-python=4.5.5.64=pypi_0
openssl=1.1.1m=h1a28f6b_0
opt-einsum=3.3.0=pypi_0
packaging=21.3=pypi_0
pandas=1.4.1=pypi_0
pillow=9.0.1=pypi_0
pip=22.0.4=pypi_0
portalocker=2.4.0=pypi_0
protobuf=3.19.4=pypi_0
pyasn1=0.4.8=pypi_0
pyasn1-modules=0.2.8=pypi_0
pydot=1.4.2=pypi_0
pyparsing=3.0.7=pypi_0
python=3.9.7=hc70090a_1
python-dateutil=2.8.2=pypi_0
python-graphviz=0.8.4=pypi_0
pytz=2022.1=pypi_0
pyyaml=6.0=pypi_0
readline=8.1.2=h1a28f6b_1
requests=2.27.1=pypi_0
requests-oauthlib=1.3.1=pypi_0
rsa=4.8=pypi_0
s3transfer=0.5.2=pypi_0
scipy=1.8.0=pypi_0
setuptools=58.0.4=py39hca03da5_1
six=1.16.0=pyhd3eb1b0_1
sqlite=3.38.0=h1058600_0
tensorboard=2.8.0=pypi_0
tensorboard-data-server=0.6.1=pypi_0
tensorboard-plugin-wit=1.8.1=pypi_0
tensorflow-deps=2.8.0=0
tensorflow-macos=2.8.0=pypi_0
termcolor=1.1.0=pypi_0
tf-estimator-nightly=2.8.0.dev2021122109=pypi_0
tk=8.6.11=hb8d0fd4_0
tqdm=4.63.1=pypi_0
typing-extensions=4.1.1=pypi_0
tzdata=2021e=hda174b7_0
urllib3=1.26.9=pypi_0
werkzeug=2.0.3=pypi_0
wheel=0.37.1=pyhd3eb1b0_0
wrapt=1.14.0=pypi_0
xz=5.2.5=h1a28f6b_0
yacs=0.1.8=pypi_0
zipp=3.7.0=pypi_0
zlib=1.2.11=h5a0b063_4

GPU error: Sagemaker mp.p2.xlarge instance using tensorflow==2.3.0

I got the following error when trying to train my tensorflow model on sagemaker ml.p2.xlarge instance. I use tensorflow==2.3.0. I wonder whether this is because of the tensorflow version incompatibility with cuda. sagemaker ml.p2.xlarge seems to use cuda 10.0
GPU error:
2020-08-31 08:46:46.429756: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/openmpi/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-08-31 08:47:02.170819: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/openmpi/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-08-31 08:47:02.764874: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1753] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
This question is probably old, but it falls back on an open issue found at the beginning of choosing which versions of frameworks to use.
The problem does not depend on the type of instance that you specified (which has NVidia GPU).
From the official documentation "Available Deep Learning Containers Images", to date 20/10/2022, precompiled versions higher than 2.2 do not seem to be usable:
Framework
Job Type
Horovod Options
CPU/GPU
Python Version Options
Example URL
TensorFlow 2.2 (Cuda 10.2)
training
Yes
GPU
3.7 (py37)
763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training:2.2.0-gpu-py37-cu102-ubuntu18.04
TensorFlow 2.2
inference
No
GPU
3.7 (py37)
763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference:2.2.0-gpu-py37-cu102-ubuntu18.04
Within the dockerfile that is used to use the container is the instruction to install the libraries that your custom version is missing:
RUN apt-get update && apt-get install -y --no-install-recommends --allow-unauthenticated \
python3-dev \
python3-pip \
python3-setuptools \
ca-certificates \
cuda-command-line-tools-10-1 \
cuda-cudart-dev-10-1 \
cuda-cufft-dev-10-1 \
cuda-curand-dev-10-1 \
cuda-cusolver-dev-10-1 \
cuda-cusparse-dev-10-1 \
curl \
libcudnn7=7.6.2.24-1+cuda10.1 \
# TensorFlow doesn't require libnccl anymore but Open MPI still depends on it
libnccl2=2.4.7-1+cuda10.1 \
libgomp1 \
libnccl-dev=2.4.7-1+cuda10.1 \
....
Then you can install the required libraries from your custom version directly with a requirements.txt file or run the install command directly in the training script.
If there are no special project requirements, I recommend using the precompiled versions of sagemaker. Otherwise, build a docker image from scratch instead of installing libraries this way..

How to build tenssorflow op with bazel with additional include directories

I got tensorflow binaries (already compiled)
I have added to tensorflow source:
tensorflow\core\user_ops\icp_op_kernel.cc - contains:
https://github.com/tensorflow/models/blob/master/research/vid2depth/ops/icp_op_kernel.cc
tensorflow\core\user_ops\BUILD - contains:
load("//tensorflow:tensorflow.bzl", "tf_custom_op_library")
tf_custom_op_library(
name = "icp_op_kernel.so",
srcs = ["icp_op_kernel.cc"],
)
I am trying to build with:
bazel build --config opt //tensorflow/core/user_ops:icp_op_kernel.so
And I get:
tensorflow/core/user_ops/icp_op_kernel.cc(16): fatal error C1083: Cannot open include file: 'pcl/point_types.h': No such file or directory
Because bazel don't know where the pcl include files are.
I have installed pcl and the include directory is in:
C:\Program Files\PCL 1.6.0\include\pcl-1.6
How do I tell bazel to also include this directory?
Also I will probably need to add C:\Program Files\PCL 1.6.0\lib to the link, How do I do that?
You don't need bazel for building ops if it fails.
I have implemented customized ops both in CPU and GPU, and basically follow the two Tensorflow tutorials.
For CPU ops, follow Tensorflow tutorial on Build the op library:
TF_CFLAGS=( $(python -c 'import tensorflow as tf; print(" ".join(tf.sysconfig.get_compile_flags()))') )
TF_LFLAGS=( $(python -c 'import tensorflow as tf; print(" ".join(tf.sysconfig.get_link_flags()))') )
g++ -std=c++11 -shared zero_out.cc -o zero_out.so -fPIC ${TF_CFLAGS[#]} ${TF_LFLAGS[#]} -O2
Note on gcc version >=5: gcc uses the new C++ ABI since version 5. The binary pip packages available on the TensorFlow website are built with gcc4 that uses the older ABI. If you compile your op library with gcc>=5, add -D_GLIBCXX_USE_CXX11_ABI=0 to the command line to make the library compatible with the older abi.
For GPU ops, check the current official GPU ops building instructions on Tensorflow adding GPU op support
nvcc -std=c++11 -c -o cuda_op_kernel.cu.o cuda_op_kernel.cu.cc \
${TF_CFLAGS[#]} -D GOOGLE_CUDA=1 -x cu -Xcompiler -fPIC
g++ -std=c++11 -shared -o cuda_op_kernel.so cuda_op_kernel.cc \
cuda_op_kernel.cu.o ${TF_CFLAGS[#]} -fPIC -lcudart ${TF_LFLAGS[#]}
As it says, Note that if your CUDA libraries are not installed in /usr/local/lib64, you'll need to specify the path explicitly in the second (g++) command above. For example, add -L /usr/local/cuda-8.0/lib64/ if your CUDA is installed in /usr/local/cuda-8.0.
Also, Note in some linux settings, additional options to nvcc compiling step are needed. Add -D_MWAITXINTRIN_H_INCLUDED to the nvcc command line to avoid errors from mwaitxintrin.h.

How to edit the linker flags bazel uses to build syntaxnet/tensorflow

I don't get Tensorflow with Syntaxnet built with CUDA on Ubuntu 16.04.
I have built it successfully without CUDA on this system.
Most likely the error is rooted in the configuration. The bazel build of tensorflow with CUDA generates linker commands for shared libraries with the linker option
-pie for generating executables with position independent code. This causes the error "undefined reference to `main'".
/home/patrick/.cache/bazel/_bazel_patrick/5b9c9cf56f3e0138be05b0752b134bcb/external/com_google_absl/absl/base/BUILD.bazel:28:1: Linking of rule '#com_google_absl//absl/base:spinlock_wait' failed (Exit 1):
crosstool_wrapper_driver_is_not_gcc failed: error executing command
`(cd /home/patrick/.cache/bazel/_bazel_patrick/5b9c9cf56f3e0138be05b0752b134bcb `/execroot/__main__ && exec env - \
CUDA_TOOLKIT_PATH=/usr/local/cuda \
CUDNN_INSTALL_PATH=/usr/local/cuda \
GCC_HOST_COMPILER_PATH=/usr/bin/gcc \
LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64:/usr/local/cuda-9.0/extras/CUPTI/lib64:/usr/local/cuda-9.0/nvvm/lib64 \
NCCL_INSTALL_PATH=/usr \ PATH=/home/patrick/bin:/home/patrick/.local/bin:/usr/local/cuda/bin:/usr/bin:/bin \
PWD=/proc/self/cwd \
PYTHON_BIN_PATH=/usr/bin/python \
PYTHON_LIB_PATH=/usr/local/lib/python2.7/dist-packages \
TF_CUDA_CLANG=0 \
TF_CUDA_COMPUTE_CAPABILITIES=6.1 \
TF_CUDA_VERSION=9.0 \
TF_CUDNN_VERSION=7 \
TF_NCCL_VERSION=2 \
TF_NEED_CUDA=1 \
TF_NEED_OPENCL_SYCL=0 \
external/local_config_cuda/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -shared -o bazel-out/k8-opt/bin/external/com_google_absl/absl/base/libspinlock_wait.so -Wl,-no-as-needed -B/usr/bin/ -pie -Wl,-z,relro,-z,now -no-canonical-prefixes -pass-exit-codes '-Wl,--build-id=md5' '-Wl,--hash-style=gnu' -Wl,--gc-sections -Wl,#bazel-out/k8-opt/bin/external/com_google_absl/absl/base/libspinlock_wait.so-2.params)
/usr/lib/gcc/x86_64-linux-gnu/5/../../../x86_64-linux-gnu/Scrt1.o: In function `_start':
(.text+0x20): undefined reference to `main'
collect2: error: ld returned 1 exit status
This linking command succeeds when removing the option -pie.
Help would be appreciated to either find a way to edit the linker flags Bazel uses or to get a hint to the configuration error I made from users that encountered a similar problem. I don't think that posting the configuration steps I did will lead to other suggestions than the ones I already read on other posts. The build process looks too shaky for me.
I already had a look at the definition in the CROSSTOOL and BUILD files. I did not edit them and they look Ok (-pie is only enabled for linking executables).
I work with
Bazel 0.15.2
Tensorflow 1.8.0
Ubuntu 16.04
gcc 5.4
CUDA 9.0
CUDNN 7.1
NCCL 2.1

tensorflow tools: can't figure out how to build/run summarize_graph

I'm trying to convert my *.pb tensorflow model to coreML. I'm getting stuck on identifying my output node of my model.
In order to obtain my output node, I've attempted to build and run summarize_graph on my *.pb file, but running into issues. How do I build and run summarize_graph after downloading the source?
I've run the following command:
bazel build tensorflow/tools/graph_transforms:summarize_graph
bazel-bin/tensorflow/tools/graph_transforms/summarize_graph --in_graph=tensorflow_inception_graph.pb
and I get the following error:
INFO: Analysed 0 targets (0 packages loaded). INFO: Found 0 targets...
INFO: Elapsed time: 0.389s, Critical Path: 0.01s INFO: Build completed
successfully, 1 total action
bazel-bin/tensorflow/tools/graph_transforms/summarize_graph: No such
file or directory
After issuing the bazel command, a blank bazel-bin directory appears in the location I executed the command.
Note, summarize_graph didn't exist in my tensorflow installation. So I downloaded the source from github tensorflow/tools/graph_transforms and copied it into my tensorflow/tools/graph_transforms directory.
the directory contains the following:
BUILD README.md
init.py
init.pyc add_default_attributes.cc add_default_attributes_test.cc backports.cc backports_test.cc compare_graphs.cc
fake_quantize_training.cc fake_quantize_training_test.cc file_utils.cc
file_utils.h file_utils_test.cc flatten_atrous.cc
flatten_atrous_test.cc fold_batch_norms.cc fold_batch_norms_test.cc
fold_constants_lib.cc fold_constants_lib.h fold_constants_test.cc
fold_old_batch_norms.cc fold_old_batch_norms_test.cc
freeze_requantization_ranges.cc freeze_requantization_ranges_test.cc
fuse_convolutions.cc fuse_convolutions_test.cc insert_logging.cc
insert_logging_test.cc obfuscate_names.cc obfuscate_names_test.cc out
python quantize_nodes.cc quantize_nodes_test.cc quantize_weights.cc
quantize_weights_test.cc remove_attribute.cc remove_attribute_test.cc
remove_device.cc remove_device_test.cc remove_ema.cc
remove_ema_test.cc remove_nodes.cc remove_nodes_test.cc
rename_attribute.cc rename_attribute_test.cc rename_op.cc
rename_op_test.cc round_weights.cc round_weights_test.cc set_device.cc
set_device_test.cc sort_by_execution_order.cc
sort_by_execution_order_test.cc sparsify_gather.cc
sparsify_gather_test.cc strip_unused_nodes.cc
strip_unused_nodes_test.cc summarize_graph_main.cc transform_graph.cc
transform_graph.h transform_graph_main.cc transform_graph_test.cc
transform_utils.cc transform_utils.h transform_utils_test.cc
I'm on a macbook pro
Thanks!
In case anyone is running into the similar problem, I solved it.
Navigate to the root of the tensorflow source directory
cmd> ./configure
cmd> bazel build tensorflow/tools/graph_transforms:summarize_graph
(you may get an error about xcode; if so, run the following)
cmd> xcode-select -s /Applications/Xcode.app/Contents/Developer
cmd> bazel clean --expunge
cmd> bazel build tensorflow/tools/graph_transforms:summarize_graph
CentOS 7 walkthrough:
yum install epel-release
yum update
yum install patch
curl https://copr.fedorainfracloud.org/coprs/vbatts/bazel/repo/epel-7/vbatts-bazel-epel-7.repo -o /etc/yum.repos.d/vbatts-bazel-epel-7.repo
yum install bazel
curl -L -O https://github.com/tensorflow/tensorflow/archive/v1.8.0.tar.gz
cd tensorflow-1.8.0
./configure # interactive!
bazel build tensorflow/tools/graph_transforms:summarize_graph
bazel-bin/tensorflow/tools/graph_transforms/summarize_graph