Stuck at training model with CPU - tensorflow-serving

As the example points out:
docker run -it -p 8500:8500 --gpus all tensorflow/serving:latest-devel
should train the mnist mode, however I want to use intel cpu for training, not gpu. But no luck, it stucked at Training model...
Here is the command I used:
docker run -it -p 8500:8500 tensorflow/serving:latest-devel

I found out that it will download resources at first, which a proxy is needed sometimes.

Related

Tensorflow serving failing with std::bad_alloc

I'm trying to run tensorflow-serving using docker compose (served model + microservice) but the tensorflow serving container fails with the error below and then restarts.
microservice | To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
tensorflow-serving | terminate called after throwing an instance of 'std::bad_alloc'
tensorflow-serving | what(): std::bad_alloc
tensorflow-serving | /usr/bin/tf_serving_entrypoint.sh: line 3: 7 Aborted
(core dumped) tensorflow_model_server --port=8500 --rest_api_port=8501 --model_name=${MODEL_NAME} --model_base_path=${MODEL_BASE_PATH}/${MODEL_NAME} "$#"
I monitored the memory usage and it seems like there's plenty of memory. I also increased the resource limit using Docker Desktop but still get the same error. Each request to the model is fairly small as the microservice is sending tokenized text with batch size of one. Any ideas?
I was encountering the same problem, and this fixed worked for me:
uninstalled and reinstalled:
tensorflow, tensorflow-gpu, etc to 2.9.0, (and trained and built my model)
docker pull and docker run tensorflow/serving:2.8.0 (this did the trick and finally got rid of this problem.)
Had the same error when using tensorflow/serving:latest. Based on Hanafi's response, I used tensorflow/serving:2.8.0 and it worked.
For reference, I used
sudo docker run -p 8501:8501 --mount type=bind,source= \
[PATH_TO_MODEL_DIRECTORY],target=/models/[MODEL_NAME] \
-e MODEL_NAME=[MODEL_NAME] -t tensorflow/serving:2.8.0
The issue is solved for TensorFlow and TensorFlow Serving 2.11 (not yet released) and fix is included in nightly release of TF serving. You can build nightly docker image or use pre-compiled version.
Also TensorFlow 2.9 and 2.10 was patched to fix this issue. Refer PR here.[1, 2]

How to select a gpu with minimum gpu-memory of 20GB in qsub/PBS (for tensorflow2.0)?

In a node of our cluster we have gpus some of them are already in use by someone else. I am submitting a job using qsub that runs a jupyter-notebook using one gpu.
#!/bin/sh
#PBS -N jupyter_gpu
#PBS -q long
##PBS -j oe
#PBS -m bae
#PBS -l nodes=anodeX:ppn=16:gpus=3
jupyter-notebook --port=111 --ip=anodeX
However, I find that qsub blocks the gpu that is already in use (the available memory shown is pretty low), thus my code gets an error of low memory. If I aks for more gpus (say 3), the code runs fine only if the GPU:0 has sufficient memory. I am struggling to understand what is happening.
Is there a way to request gpu-memory in qsub?
Note that #PBS -l mem=20gb demands only the cpu memory. I am using tensorflow 2.9.1.

Tensorflow Serving Compiling Failure For CPU AVX AVX2

I use the method in the tfx official document to compile the tfx devel in docker file. The OS is MacOS, intel CPU.
here is the docker build code for it
#!/bin/bash
USER=$1
TAG=$2
TF_SERVING_VERSION_GIT_BRANCH="2.4.1"
git clone --branch="${TF_SERVING_VERSION_GIT_BRANCH}" https://github.com/tensorflow/serving
TF_SERVING_BUILD_OPTIONS="--copt=-mavx --local_ram_resources=4096"
cd serving && \
docker build --pull -t $USER/tensorflow-serving-devel:$TAG \
--build-arg TF_SERVING_VERSION_GIT_BRANCH="${TF_SERVING_VERSION_GIT_BRANCH}" \
--build-arg TF_SERVING_BUILD_OPTIONS="${TF_SERVING_BUILD_OPTIONS}" \
-f tensorflow_serving/tools/docker/Dockerfile.devel .
Then I run the shell script with >3hrs and get the following failure:
Actually I cannot know the detail because the log file from docker is clipped by the builder.
Does anyone met the similar problem and can help on this topic?
Thanks a lot in advance!
These instruction sets are not available on all machines, especially with older processors.
If you'd like to apply generally recommended optimizations, including utilizing platform-specific instruction sets for your processor, you can add --config=nativeopt to Bazel build commands when building TensorFlow Serving.
tools/run_in_docker.sh bazel build --config=nativeopt tensorflow_serving/...

YOLO darknet freezes when I start training my model

I am training a model with yolo darknet in google colab but when I start the training the page freezes and a pop-up window appears that the web page does not respond
I don't know if it is because the model has many classes to train and the page collapses
here is my code:
!apt-get update
!unzip "/content/drive/My Drive/custom_dib_model/darknet.zip"
!sudo apt install dos2unix
!find . -type f -print0 | xargs -0 dos2unix
!chmod +x /content/darknet
!make
!./darknet detector test cfg/coco.data cfg/yolov4.cfg yolov4.weights data/person.jpg
!rm /content/darknet/backup -r
!ln -s /content/drive/'My Drive'/dib_weights/backup /content/darknet
!./darknet detector train dibujos_dataset/dib.data dib_yolov4.cfg yolov4.conv.137 -map -dont_show
The last line is the one that begins the training of my model and about 5 min pass when the page freezes, it should be noted that no error appears
I found a similar question but there is no concrete answer
possible answer by the user
This is all the information that I can give you and I hope it is enough
I solved it by changing the YOLO configuration dib_yolo.cfg file modifying the subdivisions from 64 to 16

How to use TensorBoard in a Docker container (on Windows)

I have installed tensorflow on windows through docker toolbox. Everything goes well except I can't use tensorboard. The command line shows 'Starting Tensorboard 29 on port 6006. You can navigate to http://localhost:6006/'.However, when I opened this address on my webbrowser, it just can not connect to it. Does anyone know how to solve this problem?
If you're running TensorBoard inside a Docker container, and trying to use a web browser in Windows to view it, you will need to set up port forwarding from the container to your Windows machine. See this answer for a longer discussion about port forwarding for TensorBoard, but you should be able to make progress by using the following command:
docker run -p 0.0.0.0:6006:6006 -it b.gcr.io/tensorflow/tensorflow
However, it may be easier to install TensorFlow directly on Windows, and run TensorBoard there. If you install Python 3.5 for Windows, you can install TensorFlow and TensorBoard by running:
pip install tensorflow
You can then run TensorBoard directly from the command prompt, and you will not need to worry about port forwarding. See the Windows installation instructions for more details.
I'd like to update the answer here, since I just ran into the same problem on Ubuntu 20.04 and the latest-gpu tensorflow docker image (03e706e09b04).
What worked for me was the following docker run:
docker run -p 8888:8888 -p 6006:6006 -it --rm -v <path_to_summaries>:/opt/summaries tensorflow/tensorflow tensorboard
And then from inside the container:
tensorboard --logdir /opt/summaries/ --bind_all
The server is then accessible at localhost:6006 as one would expect.
The main difference here is, I guess, adding the --bind_all flag to the tensorboard call which exposes the server to external networks, thus allowing the host machine access.
Maybe you should map your volumes to the folder with the logs and enter with bash well:
docker run -v //c/pathto/tf_logs:/tf_logs
-p 0.0.0.0:6006:6006 -p 8888:8888 -it b.gcr.io/tensorflow/tensorflow bash
cd ..
tensorboard --logdir tf_logs/
hit the mapping in your browser
http://192.168.99.100:6006
On Windows 10 + WSL2 + Docker using the offical tensorflow/tensorflow:latest-gpu-py3-jupyter image, I had to tell TB to bind to the wildcard address. That is, in the Jupyter notebook, I called:
%tensorboard --logdir logs/ --host 0.0.0.0
After this, I was able to see the embedded dashboard in my notebook.