How to run TensorFlow 2 in a distributed environment with Horovod? - tensorflow

I have successfully set up the distributed environment and run the example with Horovod. And I also know that if I want to run the benchmark on TensorFlow 1 in a distributed setup, e.g. 4 nodes, following the tutorial, the submission should be:
$ horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 \
python scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
--model resnet101 \
--batch_size 64 \
--variable_update horovod \
--data_dir /path/to/imagenet/tfrecords \
--data_name imagenet \
--num_batches=2000
But now I want to run the TensorFlow 2 official models, for example BERT model. What command should I use?

Related

Use non quantified deeplab model on coral board

I use a quantified model of deeplab v3 on coral dev board. The result is not really precise so I want to use a non quantified model.
I can find a tflite non quantified model of deeplab so I want to generate it.
I donwload a xception65_coco_voc_trainval model on Tensorflow github:
xception65_coco_voc_trainval
Then I use that command to tranform the input:
bazel-bin/tensorflow/tools/graph_transforms/transform_graph \
--in_graph="/path/to/xception65_coco_voc_trainval.pb" \
--out_graph="/path/to/xception65_coco_voc_trainval_flatten.pb" \
--inputs='ImageTensor' \
--outputs='SemanticPredictions' \
--transforms='
strip_unused_nodes(type=quint8, shape="1,513,513,3")
flatten_atrous_conv
fold_constants(ignore_errors=true, clear_output_shapes=false)
fold_batch_norms
fold_old_batch_norms
remove_device
sort_by_execution_order
Then I generate the tflite file with this command:
tflite_convert \
--graph_def_file="/tmp/deeplab_mobilenet_v2_opt_flatten_static.pb" \
--output_file="/tmp/deeplab_mobilenet_v2_opt_flatten_static.tflite" \
--output_format=TFLITE \
--input_shape=1,513,513,3 \
--input_arrays="ImageTensor" \
--inference_type=FLOAT \
--inference_input_type=QUANTIZED_UINT8 \
--std_dev_values=128 \
--mean_values=128 \
--change_concat_input_ranges=true \
--output_arrays="SemanticPredictions" \
--allow_custom_ops
This command generates me a tflite file. I donwload the file on my coral dev board and I try to run it with.
I use that example on github to try my deeplab model on coral:
deeplab on coral
When I start the program there is an error:
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted
The error comes from that line:
engine = BasicEngine(args.model)

freeze model for inference with output_node_name for ssd mobilenet v1 coco

I want to compile the TensorFlow Graph to Movidius Graph. I have used Model Zoo's ssd_mobilenet_v1_coco model to train it on my own dataset.
Then I ran
python object_detection/export_inference_graph.py \
--input_type=image_tensor \
--pipeline_config_path=/home/redtwo/nsir/ssd_mobilenet_v1_coco.config \
--trained_checkpoint_prefix=/home/redtwo/nsir/train/model.ckpt-3362 \
--output_directory=/home/redtwo/nsir/output
which generates me frozen_interference_graph.pb & saved_model/saved_model.pb
Now to convert this saved model into Movidius graph. There are commands given
Export GraphDef file
python3 ../tensorflow/tensorflow/python/tools/freeze_graph.py \
--input_graph=inception_v3.pb \
--input_binary=true \
--input_checkpoint=inception_v3.ckpt \
--output_graph=inception_v3_frozen.pb \
--output_node_name=InceptionV3/Predictions/Reshape_1
Freeze model for inference
python3 ../tensorflow/tensorflow/python/tools/freeze_graph.py \
--input_graph=inception_v3.pb \
--input_binary=true \
--input_checkpoint=inception_v3.ckpt \
--output_graph=inception_v3_frozen.pb \
--output_node_name=InceptionV3/Predictions/Reshape_1
which can finally be feed to NCS Intel Movidius SDK
mvNCCompile -s 12 inception_v3_frozen.pb -in=input -on=InceptionV3/Predictions/Reshape_1
All of this is given at Intel Movidius Website here: https://movidius.github.io/ncsdk/tf_modelzoo.html
My model was already trained i.e. output/frozen_inference_graph. Why do I again freeze it using /slim/export_inference_graph.py or it's the output/saved_model/saved_model.py that will go as input to slim/export_inference_graph.py??
All I want is output_node_name=Inceptionv3/Predictions/Reshape_1. How to get this output_name_name directory structure & anything inside it? I don't know what all it contains
what output node should I use for model zoo's ssd_mobilenet_v1_coco model(trained on my own custom dataset)
python freeze_graph.py \
--input_graph=/path/to/graph.pbtxt \
--input_checkpoint=/path/to/model.ckpt-22480 \
--input_binary=false \
--output_graph=/path/to/frozen_graph.pb \
--output_node_names="the nodes that you want to output e.g. InceptionV3/Predictions/Reshape_1 for Inception V3 "
Things I understand & don't understand:
input_checkpoint: ✓ [check points that were created during training]
output_graph: ✓ [path to output frozen graph]
out_node_names: X
I don't understand out_node_names parameter & what should inside this considering its ssd_mobilnet not inception_v3
System information
What is the top-level directory of the model you are using:
Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
TensorFlow installed from (source or binary): TensorFlow installed with pip
TensorFlow version (use command below): 1.13.1
Bazel version (if compiling from source):
CUDA/cuDNN version: V10.1.168/7.*
GPU model and memory: 2080Ti 11Gb
Exact command to reproduce:
The graph in saved_model/saved_model.pb is the graph definition(graph architecture) of the pretrained inception_v3 model without the weights loaded to the graph. The frozen_interference_graph.pb is the graph frozen with the checkpoints you have provided and taking the default output nodes of the inception_v3 model.
To get output node names summarise_graph tool can be used
You can use the below commands to use summarise_graph tool if bazel is installed
bazel build tensorflow/tools/graph_transforms:summarize_graph
bazel-bin/tensorflow/tools/graph_transforms/summarize_graph \
--in_graph=/tmp/inception_v3_inf_graph.pb
In case if bazel is not installed Output nodes can be obtained using the tensorboard or any other graph visualising tools like Netron.
The additional freeze_graph.py can be used to freeze the graph specifying the output nodes(ie in a case where additional output nodes are added to the inceptionV3). The frozen_interference_graph.pb is also an equaly good fit for infrencing.

Evaluation and Visualization of DeeplabV3 completed or not?

Disclaimer: First time i am trying Machine Learning!
We have a requirement of Automatic segmentation of a objects in an image from Background. Through internet we found that "Deep lab" will solve our purpose. we downloaded the deeplab from their offical site and followed all the instructions that they have mentioned. we trained the pascal_voc_2012 dataset with below command
python deeplab/train.py \
--logtostderr \
--training_number_of_steps=30000 \
--train_split="train" \
--model_variant="xception_65" \
--atrous_rates=6 \
--atrous_rates=12 \
--atrous_rates=18 \
--output_stride=16 \
--decoder_output_stride=4 \
--train_crop_size=513 \
--train_crop_size=513 \
--train_batch_size=1 \
--dataset="pascal_voc_seg" \
--tf_initial_checkpoint=/home/ktpl13/Desktop/models-master/research/deeplab/datasets/pascal_voc_seg/checkpoint
\
--train_logdir=/home/ktpl13/Desktop/models-master/research/deeplab/datasets/pascal_voc_seg/exp/train_on_train_set/train$
\
--dataset_dir=/home/ktpl13/Desktop/models-master/research/deeplab/datasets/pascal_voc_seg/tfrecord
Training is done after 50 hours. Then i started the Evaluation using below command
python deeplab/eval.py \
--logtostderr \
--eval_split="val" \
--model_variant="xception_65" \
--atrous_rates=6 \
--atrous_rates=12 \
--atrous_rates=18 \
--output_stride=16 \
--decoder_output_stride=4 \
--eval_crop_size=513 \
--eval_crop_size=513 \
--dataset="pascal_voc_seg" \
--checkpoint_dir=/home/ktpl13/Desktop/models-master/research/deeplab/datasets/pascal_voc_seg/exp/train_on_train_set/train/
\
--eval_logdir=/home/ktpl13/Desktop/models-master/research/deeplab/datasets/pascal_voc_seg/exp/train_on_train_set/eval/
\
--dataset_dir=/home/ktpl13/Desktop/models-master/research/deeplab/datasets/pascal_voc_seg/tfrecord
After executing the above command, it found one checkpoint correctly, but after that it stays with this message
"Waiting for checkpoint at
home/ktpl13/Desktop/models-master/research/deeplab/datasets/pascal_voc_seg/exp/train_on_train_set/train/"
So i terminated the execution of Eval after 2 hours and started the visualization with below command
python deeplab/vis.py \
--logtostderr \
--vis_split="val" \
--model_variant="xception_65" \
--atrous_rates=6 \
--atrous_rates=12 \
--atrous_rates=18 \
--output_stride=16 \
--decoder_output_stride=4 \
--vis_crop_size=513 \
--vis_crop_size=513 \
--dataset="pascal_voc_seg" \
--checkpoint_dir=/home/ktpl13/Desktop/models-master/research/deeplab/datasets/pascal_voc_seg/exp/train_on_train_set/train/
\
--vis_logdir=/home/ktpl13/Desktop/models-master/research/deeplab/datasets/pascal_voc_seg/exp/train_on_train_set/vis/
\
--dataset_dir=/home/ktpl13/Desktop/models-master/research/deeplab/datasets/pascal_voc_seg/tfrecord/
visualization also executed for one checkpoint and then again got the same message like Eval.
"Waiting for checkpoint at
home/ktpl13/Desktop/models-master/research/deeplab/datasets/pascal_voc_seg/exp/train_on_train_set/train/"
Again i terminated the execution of vis. There is a folder generated under vis with name "segmentation_results" which contains the "prediction.png" for each input image. which is "completly black image".
Now My questions are.
Did My Evaluation and visualization are done? or am i doing something wrong?
Why the predicted images all are Black?
About the eval waiting for another checkpoint, it's because the default expects to run along with the train process. To run the eval script only once, after training, add this flag to the eval.sh script:
--max_number_of_evaluations = 1
And you can view the value using TensorBoard.
The vis.sh script appears to be running correctly as it's saving the images to the directory. The issue with all black images is a different problem (e.g: dataset configuration, label weights, colormap removal, etc).
For future reference, I ran into the same problem. After I found out what happened I laughed so hard.
Both eval and vis ran as expected.
For eval, right above your output of "waiting for checkpoints," there should be a line that says "miou[your model accuracy here]" It is a tiny line and easy to miss.
For vis, you will find your segmented result in the vis logdir you provided in your vis command.
More in depth, both eval and vis have successfully analyze the network your trained, and as a feature they are waiting for more checkpoints in case you decided to train more networks to compare.

Google Cloud ML: Use Nightly TF Import Error No Module tensorflow

I want to train the NMT model from Google on Google Cloud ML.
NMT Model
Now I put all input data in a bucket and downloaded the git repository.
The model needs the nightly version of tensorflow so I defined it in setup.py and when I use the cpu version tf-nightly==1.5.0-dev20171115 and run the following command to run it in GCP local it works.
Train local on google:
gcloud ml-engine local train --package-path nmt/ \
--module-name nmt.nmt \
-- --src=en --tgt=de \
--hparams_path=$HPARAMAS_PATH \
--out_dir=$OUTPUT_DIR \
--vocab_prefix=$VOCAB_PREFIX \
--train_prefix=$TRAIN_PREFIX \
--dev_prefix=$DEV_PREFIX \
--test_prefix=$TEST_PREFIX
Now when I use the gpu version with the following command I got this error message few minutes after submitting the job.
Train on cloud
gcloud ml-engine jobs submit training $JOB_NAME \
--runtime-version 1.2 \
--job-dir $JOB_DIR \
--package-path nmt/ \
--module-name nmt.nmt \
--scale-tier BAISC_GPU \
--region $REGION \
-- --src=en --tgt=de \
--hparams_path=$HPARAMAS_PATH \
--out_dir=$OUTPUT_DIR \
--vocab_prefix=$VOCAB_PREFIX \
--train_prefix=$TRAIN_PREFIX \
--dev_prefix=$DEV_PREFIX \
--test_prefix=$TEST_PREFIX
Error:
import tensorflow as tf ImportError: No module named tensorflow
setup.py:
from setuptools import find_packages
from setuptools import setup
REQUIRED_PACKAGES = ['tf-nightly-gpu==1.5.0-dev20171115']
setup(
name="nmt",
install_requires=REQUIRED_PACKAGES,
packages=find_packages(),
include_package_data=True,
version='0.1.2'
)
Thank you all in advance
Markus
Update:
I have found a note on
GCP docs
Note: Training with TensorFlow versions 1.3+ is limited to CPUs only. See the Cloud ML Engine release notes for updates.
So it seems to doesn't work currently I think I have to go with the compute engine.
Or is there any hack to got it working?
However thank you for your help
The TensorFlow 1.5 might need newer version of CUDA (i.e., CUDA 9), and but the version CloudML Engine installed is CUDA 8. Can you please try to use TensorFlow 1.4 instead, which works on CUDA 8? Please tell us if 1.4 works for you here or send us an email via cloudml-feedback#google.com

RuntimeError: `RunConfig` instance is expected to be used by the `Estimator` inside the `Experiment`

Trying to run this tutorial experiment:
https://cloud.google.com/ml-engine/docs/how-tos/getting-started-training-prediction#local-train-single
Running locally, in virtualenv, Python v2.7, TensorFlow v1.2
When executing this command:
gcloud ml-engine local train \
--module-name trainer.task \
--package-path trainer/ \
-- \
--train-files $TRAIN_DATA \
--eval-files $EVAL_DATA \
--train-steps 1000 \
--job-dir $MODEL_DIR \
--eval-steps 100
I get the following error:
RuntimeError: `RunConfig` instance is expected to be used by the `Estimator` inside the `Experiment`.
expected
_cluster_spec={},
_environment=u'cloud',
_evaluation_master='',
_is_chief=True,
_master='',
_model_dir='output',
_num_ps_replicas=0,
_num_worker_replicas=0,
_task_id=0,
_task_type=None,
_tf_config=<tensorflow.core.protobuf.config_pb2.ConfigProto object at 0x111833938>,
but got
_cluster_spec={},
_environment=u'cloud',
_evaluation_master='',
_is_chief=True,
_master='',
_model_dir='output',
_num_ps_replicas=0,
_num_worker_replicas=0,
_task_id=0,
_task_type=None,
_tf_config=<tensorflow.core.protobuf.config_pb2.ConfigProto object at 0x111833c80>
So, it appears it is not finding the _tf_config at the right address. Have not been able to find documentation on how to set that up. Thanks.
UPDATE:
Appears to have something to do with my virtualenv setup. Works fine when I install tensorflow natively.