How to deploy a PaddlePaddle Docker container with GPU support? - gpu

There're slight discrepancies in the documentation Docker deployment for PaddlePaddle as compared to the documentation to manually install PaddlePaddle from source.
The documentation from the Docker deployment states after pulling the container from Docker Hub:
docker pull paddledev/paddle
the environment variables should be set and included in the docker run, i.e.:
export CUDA_SO="$(\ls /usr/lib64/libcuda* | xargs -I{} echo '-v {}:{}') $(\ls /usr/lib64/libnvidia* | xargs -I{} echo '-v {}:{}')"
export DEVICES=$(\ls /dev/nvidia* | xargs -I{} echo '--device {}:{}')
docker run ${CUDA_SO} ${DEVICES} -it paddledev/paddle:gpu-latest
The export commands seem to be looking for libcuda* and libnvidia* in /usr/lib64/ but in the documentation from source compilation, the location of lib64/ should be in /usr/local/cuda/lib64.
Regardless, the location of lib64/ can be found with:
cat /etc/ld.so.conf.d/cuda.conf
Additionally, the export command is looking for libnvidia* which doesn't seem exist anywhere in /usr/local/cuda/, except for libnvidia-ml.so:
/usr/local/cuda$ find . -name 'libnvidia*'
./lib64/stubs/libnvidia-ml.so
I suppose the correct files the CUDA_SO is looking for are
/usr/local/cuda/lib64/libcudart.so.8.0
/usr/local/cuda/lib64/libcudart.so.7.5
But is that right? What is the environmental variable(s) for CUDA_SO to deploy PaddlePaddle with GPU support?
Even after setting the libcudart* variable, the docker container doesn't seem to find the GPU driver, i.e.:
user0#server1:~/dockdock$ echo CUDA_SO="$(\ls $CUDA_CONFILE/libcuda* | xargs -I{} echo '-v {}:{}')"
CUDA_SO=-v /usr/local/cuda/lib64/libcudadevrt.a:/usr/local/cuda/lib64/libcudadevrt.a
-v /usr/local/cuda/lib64/libcudart.so:/usr/local/cuda/lib64/libcudart.so
-v /usr/local/cuda/lib64/libcudart.so.8.0:/usr/local/cuda/lib64/libcudart.so.8.0
-v /usr/local/cuda/lib64/libcudart.so.8.0.44:/usr/local/cuda/lib64/libcudart.so.8.0.44
-v /usr/local/cuda/lib64/libcudart_static.a:/usr/local/cuda/lib64/libcudart_static.a
user0# server1:~/dockdock$ export CUDA_SO="$(\ls $CUDA_CONFILE/libcuda* | xargs -I{} echo '-v {}:{}')"
user0# server1:~/dockdock$ export DEVICES=$(\ls /dev/nvidia* | xargs -I{} echo '--device {}:{}')
user0# server1:~/dockdock$ docker run ${CUDA_SO} ${DEVICES} -it paddledev/paddle:gpu-latest
root#bd25dfd4f824:/# git clone https://github.com/baidu/Paddle paddle
Cloning into 'paddle'...
remote: Counting objects: 26626, done.
remote: Compressing objects: 100% (23/23), done.
remote: Total 26626 (delta 3), reused 0 (delta 0), pack-reused 26603
Receiving objects: 100% (26626/26626), 25.41 MiB | 4.02 MiB/s, done.
Resolving deltas: 100% (18786/18786), done.
Checking connectivity... done.
root#bd25dfd4f824:/# cd paddle/demo/quick_start/
root#bd25dfd4f824:/paddle/demo/quick_start# sed -i 's|--use_gpu=false|--use_gpu=true|g' train.sh
root#bd25dfd4f824:/paddle/demo/quick_start# bash train.sh
I0410 09:25:37.300365 48 Util.cpp:155] commandline: /usr/local/bin/../opt/paddle/bin/paddle_trainer --config=trainer_config.lr.py --save_dir=./output --trainer_count=4 --log_period=100 --num_passes=15 --use_gpu=true --show_parameter_stats_period=100 --test_all_data_in_one_period=1
F0410 09:25:37.300940 48 hl_cuda_device.cc:526] Check failed: cudaSuccess == cudaStat (0 vs. 35) Cuda Error: CUDA driver version is insufficient for CUDA runtime version
*** Check failure stack trace: ***
# 0x7efc20557daa (unknown)
# 0x7efc20557ce4 (unknown)
# 0x7efc205576e6 (unknown)
# 0x7efc2055a687 (unknown)
# 0x895560 hl_specify_devices_start()
# 0x89576d hl_start()
# 0x80f402 paddle::initMain()
# 0x52ac5b main
# 0x7efc1f763f45 (unknown)
# 0x540c05 (unknown)
# (nil) (unknown)
/usr/local/bin/paddle: line 109: 48 Aborted (core dumped) ${DEBUGGER} $MYDIR/../opt/paddle/bin/paddle_trainer ${#:2}
[1]: http://www.paddlepaddle.org/doc/build/docker_install.html
[2]: http://paddlepaddle.org/doc/build/build_from_source.html
How to deploy a PaddlePaddle Docker container with GPU support?
Also, in Chinese: https://github.com/PaddlePaddle/Paddle/issues/1764

Please refer to http://www.paddlepaddle.org/develop/doc/getstarted/build_and_install/docker_install_en.html
The recommended way is using nvidia-docker.
Please install nvidia-docker first following this tutorial.
Now you can run a GPU image:
docker pull paddlepaddle/paddle
nvidia-docker run -it --rm paddlepaddle/paddle:0.10.0rc2-gpu /bin/bash

Related

GitHub Packages: Inconsistency when running Docker Image Locally and in GitHub Actions

I have been using Github actions to build and publish a docker image to the Github Container Registry according to the Documentation. I am getting an inconsistency behavior when I pull the new image and test it locally.
I have a CMake project in C++ that runs a simple hello world with an INTERFACE and SHARED library.
When I build a docker image locally and test it, this is the output (which is working fine):
*************************************
*** DBSCAN Cluster Segmentation ***
*************************************
--cloudfile: required.
Usage: program [options]
Optional arguments:
-h --help shows help message and exits [default: false]
-v --version prints version information and exits [default: false]
--cloudfile input cloud file [required]
--octree-res octree resolution [default: 120]
--eps epsilon value [default: 40]
--minPtsAux minimum auxiliar points [default: 5]
--minPts minimum points [default: 5]
-o --output-dir output dir to save clusters [default: "-"]
--ext cluster output extension [pcd, ply, txt, xyz] [default: "pcd"]
-d --display display clusters in the pcl visualizer [default: false]
--cal-eps calculate the value of epsilon with the distance to the nearest n points [default: false]
In Github Actions I am using this workflow:
name: Demo Push
on:
push:
# Publish `master` as Docker `latest` image.
branches: ["test-github-packages"]
# Publish `v1.2.3` tags as releases.
tags:
- v*
# Run tests for any PRs.
pull_request:
env:
IMAGE_NAME: dbscan-octrees
jobs:
# Push image to GitHub Packages.
# See also https://docs.docker.com/docker-hub/builds/
push:
runs-on: ubuntu-latest
permissions:
packages: write
contents: read
steps:
- uses: actions/checkout#v3
with:
submodules: recursive
- name: Build image
run: docker build --file Dockerfile --tag $IMAGE_NAME --label "runnumber=${GITHUB_RUN_ID}" .
- name: Test image
run: |
docker run --rm \
--env="DISPLAY" \
--env="QT_X11_NO_MITSHM=1" \
--volume="/tmp/.X11-unix:/tmp/.X11-unix:rw" \
dbscan-octrees:latest
- name: Log in to registry
# This is where you will update the PAT to GITHUB_TOKEN
run: echo "${{ secrets.GITHUB_TOKEN }}" | docker login ghcr.io -u $ --password-stdin
- name: Push image
run: |
IMAGE_ID=ghcr.io/${{ github.repository_owner }}/$IMAGE_NAME
# Change all uppercase to lowercase
IMAGE_ID=$(echo $IMAGE_ID | tr '[A-Z]' '[a-z]')
# Strip git ref prefix from version
VERSION=$(echo "${{ github.ref }}" | sed -e 's,.*/\(.*\),\1,')
# Strip "v" prefix from tag name
[[ "${{ github.ref }}" == "refs/tags/"* ]] && VERSION=$(echo $VERSION | sed -e 's/^v//')
# Use Docker `latest` tag convention
[ "$VERSION" == "master" ] && VERSION=latest
echo IMAGE_ID=$IMAGE_ID
echo VERSION=$VERSION
docker tag $IMAGE_NAME $IMAGE_ID:latest
docker push $IMAGE_ID:latest
The compilation and test steps are working fine with no errors (check this run). The problem is with the newly generated image after the push to the Github Container registry since when I pulled it locally to test it, the program is crashing with an "Illegal Instruction (core dumped)" error. I have tried to debug to find the problem and there is not a compilation error, link error, or something like that. I found out that this might be related to the linking part of the SHARED library, but it is strange because if the image is working when is built in the Github Action runner, I don't understand why fails the pushed image.
I found this post where the error might be something related to Github that changes the container during the installation.
Hope someone can help me with this.
This is the output in the Test image step on the workflow:
workflow
This is the error after pulling the newly generated image and testing it locally: error
I have even compared the bad binary file (Github version in the docker image) with the good version (Compiled version locally) using ghex, and the binary file generated by GitHub after pushing a new image is a little bigger than the good one.
binary comparision
binary sizes
Issue
CPU AVX instruction set not supported by local PC
Solution
Enable compilation flags in CMake to disable AVX support
Description
After digging using analysis tools for binaries files, debugging, etc. I discovered that the problem was related to the AVX CPU support in the GitHub action runner. My Computer does not support AVX optimized instructions, so I have to enable a compilation flag for my shared libraries in order to disable AVX support. This compilation flag will tell the Github Action runner to compile the project with no AVX CPU support or CPU optimizations which is the standard environment in GitHub Actions.
Analysis tools:
ldd binary
strace binary <-- this one allows me to identify the SIGEV_SIGNAL error code
container-diff
log error
Using the strace tool I got the next error:
--- SIGILL {si_signo=SIGILL, si_code=ILL_ILLOPN, si_addr=0x55dcd7324bc0} ---
+++ killed by SIGILL (core dumped) +++
Illegal instruction (core dumped)
This error allowed me to find the error code and after searching on the internet I found a solution to my specific problem since my project was using Point cloud Library (PCL), I compiled my project with -mno-avx, according to this post.
Solution
In the CMakeList.txt file for each SHARED library define the next compilation flag:
target_compile_options(${PROJECT_NAME} PUBLIC -mno-avx)
New issue
I have resolved the major issue, but now one of my shared libraries has the same error. I will try to fix it with one of these (I think) flags.
After making a lot of tests and using CPU-X software and detecting the proper architecture-specific options in my PC with the following command via GCC:
gcc -march=native -E -v - </dev/null 2>&1 | grep cc1
output:
/usr/lib/gcc/x86_64-linux-gnu/9/cc1 -E -quiet -v -imultiarch
x86_64-linux-gnu - -march=haswell -mmmx -mno-3dnow -msse -msse2
-msse3 -mssse3 -mno-sse4a -mcx16 -msahf -mmovbe -mno-aes -mno-sha
-mpclmul -mpopcnt -mabm -mno-lwp -mno-fma -mno-fma4 -mno-xop
-mno-bmi -mno-sgx -mno-bmi2 -mno-pconfig -mno-wbnoinvd -mno-tbm
-mno-avx -mno-avx2 -msse4.2 -msse4.1 -mlzcnt -mno-rtm -mno-hle
-mrdrnd -mno-f16c -mfsgsbase -mno-rdseed -mno-prfchw -mno-adx
-mfxsr -mno-xsave -mno-xsaveopt -mno-avx512f -mno-avx512er
-mno-avx512cd -mno-avx512pf -mno-prefetchwt1 -mno-clflushopt
-mno-xsavec -mno-xsaves -mno-avx512dq -mno-avx512bw -mno-avx512vl
-mno-avx512ifma -mno-avx512vbmi -mno-avx5124fmaps
-mno-avx5124vnniw -mno-clwb -mno-mwaitx -mno-clzero -mno-pku
-mno-rdpid -mno-gfni -mno-shstk -mno-avx512vbmi2 -mno-avx512vnni
-mno-vaes -mno-vpclmulqdq -mno-avx512bitalg -mno-avx512vpopcntdq
-mno-movdiri -mno-movdir64b -mno-waitpkg -mno-cldemote
-mno-ptwrite --param l1-cache-size=32
--param l1-cache-line-size=64 --param l2-cache-size=3072
-mtune=haswell -fasynchronous-unwind-tables
-fstack-protector-strong -Wformat -Wformat-security
-fstack-clash-protection -fcf-protection
Final solution
I have fixed the execution error with the following flags in my SHARED library:
# MMX, SSE(1, 2, 3, 3S, 4.1, 4.2), CLMUL, RdRand, VT-x, x86-64
target_compile_options(${PROJECT_NAME} PRIVATE -Wno-cpp
-mmmx
-msse
-msse2
-msse3
-mssse3
-msse4.2
-msse4.1
-mno-sse4a
-mno-avx
-mno-avx2
-mno-fma
-mno-fma4
-mno-f16c
-mno-xop
-mno-bmi
-mno-bmi2
-mrdrnd
-mno-3dnow
-mlzcnt
-mfsgsbase
-mpclmul
)
Now, the docker image stored in the GitHub Container Registry is working as expected on my local PC.
Related posts
What is the proper architecture-specific options (-m) for Sandy Bridge based Pentium?
using cmake to make a library without sse support (windows version)
https://github.com/PointCloudLibrary/pcl/issues/5248
Compile errors with Assembler messages
https://github.com/PointCloudLibrary/pcl/issues/1837

GraphDB Docker Container Fails to Run: adoptopenjdk/openjdk12:alpine

When using the standard DockerFile available here, GraphDB fails to start with the following output:
Could not find any executable java binary. Please install java in your PATH or set JAVA_HOME
Looking into it, the DockerFile uses adoptopenjdk/openjdk11:alpine which was recently updated to Alpine 3.14.
If I switch to an older Docker image (or use adoptopenjdk/openjdk12:alpine) then GraphDB starts without a problem.
How can I fix this while still using the latest version of adoptopenjdk/openjdk11:alpine?
Below is the DockerFile:
FROM adoptopenjdk/openjdk11:alpine
# Build time arguments
ARG version=9.1.1
ARG edition=ee
ENV GRAPHDB_PARENT_DIR=/opt/graphdb
ENV GRAPHDB_HOME=${GRAPHDB_PARENT_DIR}/home
ENV GRAPHDB_INSTALL_DIR=${GRAPHDB_PARENT_DIR}/dist
WORKDIR /tmp
RUN apk add --no-cache bash curl util-linux procps net-tools busybox-extras wget less && \
curl -fsSL "http://maven.ontotext.com/content/groups/all-onto/com/ontotext/graphdb/graphdb-${edition}/${version}/graphdb-${edition}-${version}-dist.zip" > \
graphdb-${edition}-${version}.zip && \
bash -c 'md5sum -c - <<<"$(curl -fsSL http://maven.ontotext.com/content/groups/all-onto/com/ontotext/graphdb/graphdb-${edition}/${version}/graphdb-${edition}-${version}-dist.zip.md5) graphdb-${edition}-${version}.zip"' && \
mkdir -p ${GRAPHDB_PARENT_DIR} && \
cd ${GRAPHDB_PARENT_DIR} && \
unzip /tmp/graphdb-${edition}-${version}.zip && \
rm /tmp/graphdb-${edition}-${version}.zip && \
mv graphdb-${edition}-${version} dist && \
mkdir -p ${GRAPHDB_HOME}
ENV PATH=${GRAPHDB_INSTALL_DIR}/bin:$PATH
CMD ["-Dgraphdb.home=/opt/graphdb/home"]
ENTRYPOINT ["/opt/graphdb/dist/bin/graphdb"]
EXPOSE 7200
The issue comes from an update in the base image. From a few weeks adopt switched to alpine 3.14 which has some issues with older container runtime (runc). The issue can be seen in the release notes: https://wiki.alpinelinux.org/wiki/Release_Notes_for_Alpine_3.14.0
Updating your Docker will fix the issue. However, if you don't wish to update your Docker, there's a workaround.
Some additional info:
The cause of the issue is that for some reason containers running in older docker versions and alpine 3.14 seem to have issues with the test flag "-x" so an if [ -x /opt/java/openjdk/bin/java ] returns false, although java is there and is executable.
You can workaround this for now by
Pull the GraphDB distribution
Unzip it
Open "setvars.in.sh" in the bin folder
Find and remove the if block around line 32
if [ ! -x "$JAVA" ]; then
echo "Could not find any executable java binary. Please install java in your PATH or set JAVA_HOME"
exit 1
fi
Zip it again and provide it in the Dockerfile without pulling it from maven.ontotext.com
Passing it to the Dockerfile is done with 'ADD'
You can check the GraphDB free version's Dockerfile for a reference on how to pass the zip file to the Dockerfile https://github.com/Ontotext-AD/graphdb-docker/blob/master/free-edition/Dockerfile

hdfsBuilderConnect error while using tfserving load model from hdfs

there is my environment info:
TensorFlow Serving version 1.14
os mac10.15.7
i want to load modle from hdfs by using tfserving.
when i build a tensorflow-serving:hadoop docker image,like this:
FROM tensorflow/serving:2.2.0
RUN apt update && apt install -y openjdk-8-jre
RUN mkdir /opt/hadoop-2.8.2
COPY /hadoop-2.8.2 /opt/hadoop-2.8.2
ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64
ENV HADOOP_HDFS_HOME /opt/hadoop-2.8.2
ENV HADOOP_HOME /opt/hadoop-2.8.2
ENV LD_LIBRARY_PATH
${LD_LIBRARY_PATH}:${JAVA_HOME}/jre/lib/amd64/server
# ENV PATH $PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
RUN echo '#!/bin/bash \n\n\
CLASSPATH=(${HADOOP_HDFS_HOME}/bin/hadoop classpath --glob)
tensorflow_model_server --port=8500 --rest_api_port=9000 \
--model_name=${MODEL_NAME} --
model_base_path=${MODEL_BASE_PATH}/${MODEL_NAME} \
"#"' > /usr/bin/tf_serving_entrypoint.sh \
&& chmod +x /usr/bin/tf_serving_entrypoint.sh
EXPOSE 8500
EXPOSE 9000
ENTRYPOINT ["/usr/bin/tf_serving_entrypoint.sh"]
and then run :
docker run -p 9001:9000 --name tensorflow-serving-11 -e MODEL_NAME=tfrest -e MODEL_BASE_PATH=hdfs://ip:port/user/cess2_test/workspace/cess/models -t tensorflow_serving:1.14-hadoop-2.8.2
i met this problem. ps:i have already modify hadoop config in hadoop-2.8.2
hdfsBuilderConnect(forceNewInstance=0, nn=ip:port, port=0, kerbTicketCachePath=(NULL), userName=(NULL))
error:(unable to get stack trace for java.lang.NoClassDefFoundError exception: ExceptionUtils::getStackTrace error.)
is there any suggestions how to solve this problem?
thanks
i solved this problem by adding hadoop absolute path to classpath.

Tensorflow Serving: no versions of servable half_plus_two found under base path /models

I'm doing Tensorflow Serving with Docker (see here for docs). Server runs on our infra here. I've succeeded at requesting my model when the command to run the container is something like:
tensorflow_model_server --port=8500 --rest_api_port=8501 \
--model_name=${MODEL_NAME} --model_base_path=${MODEL_BASE_PATH}/${MODEL_NAME}
A curl request to the server returns the expected answer. Problem occurs when I try to use the model_config_file parameter. Command:
tensorflow_model_server --port=8500 --rest_api_port=8501 \
--model_config_file=/serving/models.conf
Config file is:
model_config_list: {
config: {
name: "half_plus_two",
base_path: "/models/",
model_platform: "tensorflow"
}
}
When I run the container with this command, I get the error:
No versions of servable half_plus_two found under base path /models/
(I've also tried to remove the trailing backslash on the base_path with no more success). I've seen this post on SO that reminds us to use a version under model dir and I have one. My /models dir is:
models
|
- half_plus_two
|
- 1
|
- saved_model.pb
- variables
- assets
Someone can help?
Had the same thing on Windows 10. I finally:
Noticed that I forgot to clone the tensorflow/serving repository to my local machine
Ran on Ubuntu-wsl-2 console, as using Windows command line, I could convince docker to correctly map the container's /models/half_plus_two to my local path (the -v option in the following command):
docker run -t --rm -p 8501:8501
-v "$TESTDATA/saved_model_half_plus_two_cpu:/models/half_plus_two"
-e MODEL_NAME=half_plus_two
tensorflow/serving &

Building Apache Impala fails

I was trying to build Apache Impala from source(newest version on github).
I followed following instructions to build Impala:
(1) clone Impala
> git clone https://git-wip-us.apache.org/repos/asf/incubator-impala.git
> cd Impala
(2) configure environmental variables
> export JAVA_HOME=/usr/lib/jvm/java-7-oracle-amd64
> export IMPALA_HOME=<path to Impala>
> export BOOST_LIBRARYDIR=/usr/lib/x86_64-linux-gnu
> export LC_ALL="en_US.UTF-8"
(3)build
${IMPALA_HOME}/buildall.sh -noclean -skiptests -build_shared_libs -format
(4) errors are shown below:
Heap is needed to find the cause. Looks like the compiler does not support the GLIBCXX_3.4.21. But the GCC is automatically downloaded by the building script.
Appreciate your help!!!
Starting from this commit https://github.com/apache/impala/commit/d5cefe07c931a0d3bf02bca97bbba05400d91a48 , Impala has been shipped with a development bootstrap script.
I tried the master branch in a fresh ubuntu 16.04 docker image and it works fine. Here is what I did.
checkout the latest impala code base and do
docker run --rm -it --privileged -v /home/amos/git/impala/:/root/Impala ubuntu:16.04
inside docker, do
apt-get update
apt-get install sudo
cd /root/Impala
comment this out in bin/bootstrap_system.sh if you don't need test data
# if ! [[ -d ~/Impala-lzo ]]
# then
# git clone https://github.com/cloudera/impala-lzo.git ~/Impala-lzo
# fi
# if ! [[ -d ~/hadoop-lzo ]]
# then
# git clone https://github.com/cloudera/hadoop-lzo.git ~/hadoop-lzo
# fi
# cd ~/hadoop-lzo/
# time -p ant package
also add this line before ssh localhost whoami
echo "source ${IMPALA_HOME}/bin/impala-config-local.sh" >> ~/.bashrc
change the build command to whatever you like in bin/bootstrap_development.sh
${IMPALA_HOME}/buildall.sh -noclean -skiptests -build_shared_libs -format
then run bin/bootstrap_development.sh
You'll be prompted for some input. Just fill in default value and it'll work.