Tensorflow Serving Compiling Failure For CPU AVX AVX2 - tensorflow-serving

I use the method in the tfx official document to compile the tfx devel in docker file. The OS is MacOS, intel CPU.
here is the docker build code for it
#!/bin/bash
USER=$1
TAG=$2
TF_SERVING_VERSION_GIT_BRANCH="2.4.1"
git clone --branch="${TF_SERVING_VERSION_GIT_BRANCH}" https://github.com/tensorflow/serving
TF_SERVING_BUILD_OPTIONS="--copt=-mavx --local_ram_resources=4096"
cd serving && \
docker build --pull -t $USER/tensorflow-serving-devel:$TAG \
--build-arg TF_SERVING_VERSION_GIT_BRANCH="${TF_SERVING_VERSION_GIT_BRANCH}" \
--build-arg TF_SERVING_BUILD_OPTIONS="${TF_SERVING_BUILD_OPTIONS}" \
-f tensorflow_serving/tools/docker/Dockerfile.devel .
Then I run the shell script with >3hrs and get the following failure:
Actually I cannot know the detail because the log file from docker is clipped by the builder.
Does anyone met the similar problem and can help on this topic?
Thanks a lot in advance!

These instruction sets are not available on all machines, especially with older processors.
If you'd like to apply generally recommended optimizations, including utilizing platform-specific instruction sets for your processor, you can add --config=nativeopt to Bazel build commands when building TensorFlow Serving.
tools/run_in_docker.sh bazel build --config=nativeopt tensorflow_serving/...

Related

Tensorflow serving failing with std::bad_alloc

I'm trying to run tensorflow-serving using docker compose (served model + microservice) but the tensorflow serving container fails with the error below and then restarts.
microservice | To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
tensorflow-serving | terminate called after throwing an instance of 'std::bad_alloc'
tensorflow-serving | what(): std::bad_alloc
tensorflow-serving | /usr/bin/tf_serving_entrypoint.sh: line 3: 7 Aborted
(core dumped) tensorflow_model_server --port=8500 --rest_api_port=8501 --model_name=${MODEL_NAME} --model_base_path=${MODEL_BASE_PATH}/${MODEL_NAME} "$#"
I monitored the memory usage and it seems like there's plenty of memory. I also increased the resource limit using Docker Desktop but still get the same error. Each request to the model is fairly small as the microservice is sending tokenized text with batch size of one. Any ideas?
I was encountering the same problem, and this fixed worked for me:
uninstalled and reinstalled:
tensorflow, tensorflow-gpu, etc to 2.9.0, (and trained and built my model)
docker pull and docker run tensorflow/serving:2.8.0 (this did the trick and finally got rid of this problem.)
Had the same error when using tensorflow/serving:latest. Based on Hanafi's response, I used tensorflow/serving:2.8.0 and it worked.
For reference, I used
sudo docker run -p 8501:8501 --mount type=bind,source= \
[PATH_TO_MODEL_DIRECTORY],target=/models/[MODEL_NAME] \
-e MODEL_NAME=[MODEL_NAME] -t tensorflow/serving:2.8.0
The issue is solved for TensorFlow and TensorFlow Serving 2.11 (not yet released) and fix is included in nightly release of TF serving. You can build nightly docker image or use pre-compiled version.
Also TensorFlow 2.9 and 2.10 was patched to fix this issue. Refer PR here.[1, 2]

GitHub Packages: Inconsistency when running Docker Image Locally and in GitHub Actions

I have been using Github actions to build and publish a docker image to the Github Container Registry according to the Documentation. I am getting an inconsistency behavior when I pull the new image and test it locally.
I have a CMake project in C++ that runs a simple hello world with an INTERFACE and SHARED library.
When I build a docker image locally and test it, this is the output (which is working fine):
*************************************
*** DBSCAN Cluster Segmentation ***
*************************************
--cloudfile: required.
Usage: program [options]
Optional arguments:
-h --help shows help message and exits [default: false]
-v --version prints version information and exits [default: false]
--cloudfile input cloud file [required]
--octree-res octree resolution [default: 120]
--eps epsilon value [default: 40]
--minPtsAux minimum auxiliar points [default: 5]
--minPts minimum points [default: 5]
-o --output-dir output dir to save clusters [default: "-"]
--ext cluster output extension [pcd, ply, txt, xyz] [default: "pcd"]
-d --display display clusters in the pcl visualizer [default: false]
--cal-eps calculate the value of epsilon with the distance to the nearest n points [default: false]
In Github Actions I am using this workflow:
name: Demo Push
on:
push:
# Publish `master` as Docker `latest` image.
branches: ["test-github-packages"]
# Publish `v1.2.3` tags as releases.
tags:
- v*
# Run tests for any PRs.
pull_request:
env:
IMAGE_NAME: dbscan-octrees
jobs:
# Push image to GitHub Packages.
# See also https://docs.docker.com/docker-hub/builds/
push:
runs-on: ubuntu-latest
permissions:
packages: write
contents: read
steps:
- uses: actions/checkout#v3
with:
submodules: recursive
- name: Build image
run: docker build --file Dockerfile --tag $IMAGE_NAME --label "runnumber=${GITHUB_RUN_ID}" .
- name: Test image
run: |
docker run --rm \
--env="DISPLAY" \
--env="QT_X11_NO_MITSHM=1" \
--volume="/tmp/.X11-unix:/tmp/.X11-unix:rw" \
dbscan-octrees:latest
- name: Log in to registry
# This is where you will update the PAT to GITHUB_TOKEN
run: echo "${{ secrets.GITHUB_TOKEN }}" | docker login ghcr.io -u $ --password-stdin
- name: Push image
run: |
IMAGE_ID=ghcr.io/${{ github.repository_owner }}/$IMAGE_NAME
# Change all uppercase to lowercase
IMAGE_ID=$(echo $IMAGE_ID | tr '[A-Z]' '[a-z]')
# Strip git ref prefix from version
VERSION=$(echo "${{ github.ref }}" | sed -e 's,.*/\(.*\),\1,')
# Strip "v" prefix from tag name
[[ "${{ github.ref }}" == "refs/tags/"* ]] && VERSION=$(echo $VERSION | sed -e 's/^v//')
# Use Docker `latest` tag convention
[ "$VERSION" == "master" ] && VERSION=latest
echo IMAGE_ID=$IMAGE_ID
echo VERSION=$VERSION
docker tag $IMAGE_NAME $IMAGE_ID:latest
docker push $IMAGE_ID:latest
The compilation and test steps are working fine with no errors (check this run). The problem is with the newly generated image after the push to the Github Container registry since when I pulled it locally to test it, the program is crashing with an "Illegal Instruction (core dumped)" error. I have tried to debug to find the problem and there is not a compilation error, link error, or something like that. I found out that this might be related to the linking part of the SHARED library, but it is strange because if the image is working when is built in the Github Action runner, I don't understand why fails the pushed image.
I found this post where the error might be something related to Github that changes the container during the installation.
Hope someone can help me with this.
This is the output in the Test image step on the workflow:
workflow
This is the error after pulling the newly generated image and testing it locally: error
I have even compared the bad binary file (Github version in the docker image) with the good version (Compiled version locally) using ghex, and the binary file generated by GitHub after pushing a new image is a little bigger than the good one.
binary comparision
binary sizes
Issue
CPU AVX instruction set not supported by local PC
Solution
Enable compilation flags in CMake to disable AVX support
Description
After digging using analysis tools for binaries files, debugging, etc. I discovered that the problem was related to the AVX CPU support in the GitHub action runner. My Computer does not support AVX optimized instructions, so I have to enable a compilation flag for my shared libraries in order to disable AVX support. This compilation flag will tell the Github Action runner to compile the project with no AVX CPU support or CPU optimizations which is the standard environment in GitHub Actions.
Analysis tools:
ldd binary
strace binary <-- this one allows me to identify the SIGEV_SIGNAL error code
container-diff
log error
Using the strace tool I got the next error:
--- SIGILL {si_signo=SIGILL, si_code=ILL_ILLOPN, si_addr=0x55dcd7324bc0} ---
+++ killed by SIGILL (core dumped) +++
Illegal instruction (core dumped)
This error allowed me to find the error code and after searching on the internet I found a solution to my specific problem since my project was using Point cloud Library (PCL), I compiled my project with -mno-avx, according to this post.
Solution
In the CMakeList.txt file for each SHARED library define the next compilation flag:
target_compile_options(${PROJECT_NAME} PUBLIC -mno-avx)
New issue
I have resolved the major issue, but now one of my shared libraries has the same error. I will try to fix it with one of these (I think) flags.
After making a lot of tests and using CPU-X software and detecting the proper architecture-specific options in my PC with the following command via GCC:
gcc -march=native -E -v - </dev/null 2>&1 | grep cc1
output:
/usr/lib/gcc/x86_64-linux-gnu/9/cc1 -E -quiet -v -imultiarch
x86_64-linux-gnu - -march=haswell -mmmx -mno-3dnow -msse -msse2
-msse3 -mssse3 -mno-sse4a -mcx16 -msahf -mmovbe -mno-aes -mno-sha
-mpclmul -mpopcnt -mabm -mno-lwp -mno-fma -mno-fma4 -mno-xop
-mno-bmi -mno-sgx -mno-bmi2 -mno-pconfig -mno-wbnoinvd -mno-tbm
-mno-avx -mno-avx2 -msse4.2 -msse4.1 -mlzcnt -mno-rtm -mno-hle
-mrdrnd -mno-f16c -mfsgsbase -mno-rdseed -mno-prfchw -mno-adx
-mfxsr -mno-xsave -mno-xsaveopt -mno-avx512f -mno-avx512er
-mno-avx512cd -mno-avx512pf -mno-prefetchwt1 -mno-clflushopt
-mno-xsavec -mno-xsaves -mno-avx512dq -mno-avx512bw -mno-avx512vl
-mno-avx512ifma -mno-avx512vbmi -mno-avx5124fmaps
-mno-avx5124vnniw -mno-clwb -mno-mwaitx -mno-clzero -mno-pku
-mno-rdpid -mno-gfni -mno-shstk -mno-avx512vbmi2 -mno-avx512vnni
-mno-vaes -mno-vpclmulqdq -mno-avx512bitalg -mno-avx512vpopcntdq
-mno-movdiri -mno-movdir64b -mno-waitpkg -mno-cldemote
-mno-ptwrite --param l1-cache-size=32
--param l1-cache-line-size=64 --param l2-cache-size=3072
-mtune=haswell -fasynchronous-unwind-tables
-fstack-protector-strong -Wformat -Wformat-security
-fstack-clash-protection -fcf-protection
Final solution
I have fixed the execution error with the following flags in my SHARED library:
# MMX, SSE(1, 2, 3, 3S, 4.1, 4.2), CLMUL, RdRand, VT-x, x86-64
target_compile_options(${PROJECT_NAME} PRIVATE -Wno-cpp
-mmmx
-msse
-msse2
-msse3
-mssse3
-msse4.2
-msse4.1
-mno-sse4a
-mno-avx
-mno-avx2
-mno-fma
-mno-fma4
-mno-f16c
-mno-xop
-mno-bmi
-mno-bmi2
-mrdrnd
-mno-3dnow
-mlzcnt
-mfsgsbase
-mpclmul
)
Now, the docker image stored in the GitHub Container Registry is working as expected on my local PC.
Related posts
What is the proper architecture-specific options (-m) for Sandy Bridge based Pentium?
using cmake to make a library without sse support (windows version)
https://github.com/PointCloudLibrary/pcl/issues/5248
Compile errors with Assembler messages
https://github.com/PointCloudLibrary/pcl/issues/1837

Freeze TensorFlow graph to use in iOS app

I have the below files:
1. retrained_graph.pb
2. retrained_labels.txt
3. _retrain_checkpoint.meta
4. _retrain_checkpoint.index
5. _retrain_checkpoint.data-00000-of-00001
6. checkpoint
Command Executed:
python freeze_graph.py
--input_graph=/Users/saurav/Desktop/example/tmp/retrained_graph.pb
--input_checkpoint=./_retrain_checkpoint
--output_graph=/Users/saurav/Desktop/example/tmp/frozen_graph.pb --output_node_names=softmax
Getting error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 44: invalid start byte
Here are screenshots:
Finally I found the answer. To freeze a graph you need to build with "bazel".
1. Install bazel by using homebrew. brew install bazel
2. If you don't have homebrew get it installed.
/usr/bin/ruby -e "$(curl -fsSL \
https://raw.githubusercontent.com/Homebrew/install/master/install)"
Clone tensorflow by command git clone https://github.com/tensorflow/tensorflow
Change directory to tensorflow in terminal
run command ./Configure. It asks few questions answer according your need. Most of them you can type "NO". It asks default path to Python you need to specify the path or just hit "enter".
Now build bazel for freeze_graph using command:
bazel build tensorflow/python/tools:freeze_graph
Keep the retrained graph and checkpoints in folder.
Run bazel command to freeze the graph.
bazel-bin/tensorflow/python/tools/freeze_graph \ --input_graph=YouDirectory/retrained_graph.pb \ --input_checkpoint=YouDirectory/_retrain_checkpoint \ --output_graph=YouDirectory/frozen_graph.pb

Installing tensorboard built from source

This is about a tensorboard which is built from source, not about pip-installed one.
I could successfully build it.
$ git clone https://github.com/tensorflow/tensorboard.git
$ cd tensorboard/
$ bazel build //tensorboard
tensorflow/tensorboard$ bazel build //tensorboard
Starting local Bazel server and connecting to it...
......................................
: (log messages here)
Target //tensorboard:tensorboard up-to-date:
bazel-bin/tensorboard/tensorboard
INFO: Elapsed time: 326.553s, Critical Path: 187.92s
INFO: 619 processes: 456 linux-sandbox, 12 local, 151 worker.
INFO: Build completed successfully, 1268 total actions
Then yes I can run it as documented in tensorboard/README.md, and it works.
$ ./bazel-bin/tensorboard/tensorboard --logdir path/to/logs
The problem is, I'd like to run it as if installed via pip like this:
$ tensorboard --logdir path/to/logs
But as far as I looked for, no script provided to create .whl file so that we can locally-pip-install it, unlike tensorflow provides one like this.
$ bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
$ sudo pip install /tmp/tensorflow_pkg/tensorflow-1.8.0-py2-none-any.whl
So... can anybody show how to do that? Making packaging script would solve this, but it should exist somewhere as long as tensorboard is provided via pip anyway. :)
My workaround so far is not clean enough:
$ ln -s /my/build/folder/tensorboard/bazel-bin/tensorboard/tensorboard ~/bin
$ ln -s /my/build/folder/tensorboard/bazel-bin/tensorboard/tensorboard.runfiles ~/bin
I appreciate your suggestions, thanks!
Update July-21:
Thanks to W JC, I found instruction is already there in tensorboard/pip_package/BUILD.
# rm -rf /tmp/tensorboard
# bazel run //tensorboard/pip_package:build_pip_package
# pip install -U /tmp/tensorboard/*py2*.pip
Though unfortunately it shows error in my environment, and I guess it's local issue maybe because I'm using anaconda.
But basically the problem was resolved. It should basically work as long as running on supported environment.
It seems there exists an script in the /tensorboard/pip_packages try to build wheels
bazel run //tensorboard/pip_package:build_pip_package ./ did generate the wheel out, but in the folder where bazel-bin points to. In my case, it's generated at ~/.cache/bazel/_bazel_peijia/b64ba42719633ff75eec6880decefcd3/execroot/org_tensorflow_tensorboard/bazel-out/k8-fastbuild/bin/tensorboard/pip_package/build_pip_package.runfiles/org_tensorflow_tensorboard/tensorboard-2.10.0a0-py3-none-any.whl

Cannot run Mesos Containers with GPU tasks

I am running Mesos on Ubuntu and am trying to execute:
mesos-execute \
--master=$(cat /etc/mesos/zk) \
--name=gpu-test \
--docker_image=nvidia/cuda \
--command="nvidia-smi" \
--framework_capabilities="GPU_RESOURCES" \
--resources="gpus:1"
and it is failing because: sh: 1: nvidia-smi: not found
even though when I run it without container support
mesos-execute \
--master=$(cat /etc/mesos/zk) \
--name=gpu-test \
--command="nvidia-smi" \
--framework_capabilities="GPU_RESOURCES" \
--resources="gpus:1"
it has access to the gpu
plus if I run it without container support but put the command as
nvidia-docker run -it nvidia/cuda nvidia-smi
it works, so it seems that the mesos containerizer doesnt have access to the GPUs. But in the /etc/mesos-slave/ directory I gave it containerizers mesos (and all the other required flags to run gpu commands). Plus non-gpu related commands are working fine.
This looks like a regression in 1.3.0. I downgraded to 1.2.1 on Ubuntu and can successfully use GPUs with Docker containers and the Mesos containerizer again.
sudo apt-get install mesos=1.2.1-2.0.1
It looks like someone filed a related bug but there's been no activity:
https://issues.apache.org/jira/browse/MESOS-7730