I'm just getting into distributed training with Tensorflow. At the moment, I run 4 processes on the same computer on different ports:
python trainer.py \
--model models/my_model
--model_dir model_dir/my_model \
--train_set data/train.csv \
--val_set data/val.csv \
--cluster_spec '{ \
"environment":"cloud" \
"cluster":{ \
"chief": ["localhost:2221"], \
"worker":["localhost:2222"], \
"ps":["localhost:2220"] \
}, \
"task":{ \
"type":"chief", \
"index":0 \
}, \
}'
The only thing that changes for each process is the end of the --cluster_spec flag where the values for task are specific for each role for each process.
Now I'm thinking about using the three computers I have at home instead of using just different processes within the same machine.
Question
Other than Kubernetes, what cluster management software could I use to simplify launching and watching those four processes across three different computers connected via WiFi? Ideally, this would be something very approachable for someone who's never done automated cluster management before.
Related
I have successfully set up the distributed environment and run the example with Horovod. And I also know that if I want to run the benchmark on TensorFlow 1 in a distributed setup, e.g. 4 nodes, following the tutorial, the submission should be:
$ horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 \
python scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
--model resnet101 \
--batch_size 64 \
--variable_update horovod \
--data_dir /path/to/imagenet/tfrecords \
--data_name imagenet \
--num_batches=2000
But now I want to run the TensorFlow 2 official models, for example BERT model. What command should I use?
I am new to distributed TensorFlow. Right now I am just trying to get some existing examples to work so I can learn how to do it right.
I am following the instruction here to train the inception network on one Linux machine with one worker and one PS.
https://github.com/tensorflow/models/tree/master/research/inception#how-to-train-from-scratch-in-a-distributed-setting
The program hangs during CreateSession with the message:
CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
This my command to start a worker:
./bazel-bin/inception/imagenet_distributed_train \
--batch_size=32 \
--data_dir=/datasets/BigLearning/jinlianw/imagenet_tfrecords/ \
--job_name='worker' \
--task_id=0 \
--ps_hosts='localhost:2222' \
--worker_hosts='localhost:2223'
This is my command to start a PS:
./bazel-bin/inception/imagenet_distributed_train \
--job_name='ps' \
--task_id=0 \
--ps_hosts='localhost:2222' \
--worker_hosts='localhost:2223'
And the PS process hangs after printing:
2018-06-29 21:40:43.097361: I
tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:332]
Started server with target: grpc://localhost:2222
Is the inception model still a valid example for distributed TensorFlow or did I do something wrong?
Thanks!
Problem resolved. Turns out it's due to GRPC. My cluster machines have an environment variable http_proxy set. Unset this variable solves the problem.
I understand that TensorFlow supports distributed training.
I find num_clones in train_image_classifier.py so that I can use multiple GPUs locally.
python $TF_MODEL_HOME/slim/train_image_classifier.py \
--num_clones=2
--train_dir=${TRAIN_DIR} \
--dataset_name=imagenet \
--dataset_split_name=train \
--dataset_dir=${DATASET_DIR} \
--model_name=vgg_19 \
--batch_size=32 \
--max_number_of_steps=100
How do I use multiple GPUs on different hosts?
You need to use --worker_replicas=<no of hosts> to train on multiple hosts with same number of GPUs. Apart from that, you have to configure --task, --num_ps_tasks, --sync_replicas, --replicas_to_aggregate if you are training on multiple hosts.
I'd suggest you give Horovod a try. I'm planning to give it a try in a couple of days.
Trying to run this tutorial experiment:
https://cloud.google.com/ml-engine/docs/how-tos/getting-started-training-prediction#local-train-single
Running locally, in virtualenv, Python v2.7, TensorFlow v1.2
When executing this command:
gcloud ml-engine local train \
--module-name trainer.task \
--package-path trainer/ \
-- \
--train-files $TRAIN_DATA \
--eval-files $EVAL_DATA \
--train-steps 1000 \
--job-dir $MODEL_DIR \
--eval-steps 100
I get the following error:
RuntimeError: `RunConfig` instance is expected to be used by the `Estimator` inside the `Experiment`.
expected
_cluster_spec={},
_environment=u'cloud',
_evaluation_master='',
_is_chief=True,
_master='',
_model_dir='output',
_num_ps_replicas=0,
_num_worker_replicas=0,
_task_id=0,
_task_type=None,
_tf_config=<tensorflow.core.protobuf.config_pb2.ConfigProto object at 0x111833938>,
but got
_cluster_spec={},
_environment=u'cloud',
_evaluation_master='',
_is_chief=True,
_master='',
_model_dir='output',
_num_ps_replicas=0,
_num_worker_replicas=0,
_task_id=0,
_task_type=None,
_tf_config=<tensorflow.core.protobuf.config_pb2.ConfigProto object at 0x111833c80>
So, it appears it is not finding the _tf_config at the right address. Have not been able to find documentation on how to set that up. Thanks.
UPDATE:
Appears to have something to do with my virtualenv setup. Works fine when I install tensorflow natively.
I want to create a chatbot using Tensorflow.I am using the code in 'github.com/tensorflow/models/tree/master/tutorials/rnn/translate'.While running the code in google-cloud-ml-engine I am getting an exception '/usr/bin/python: No module named data_utils' and the job is getting failed.
Here is the commands I used,
gcloud ml-engine jobs submit training ${JOB_NAME} \
--package-path=. \
--module-name=translate.translate \
--staging-bucket="${TRAIN_BUCKET}" \
--region=us-central1 \
-- \
--from_train_data=${INPUT_TRAIN_DATA_A} \
--to_train_data=${INPUT_TRAIN_DATA_B} \
--from_dev_data=${INPUT_TEST_DATA_A} \
--to_dev_data=${INPUT_TEST_DATA_B} \
--train_dir="${TRAIN_PATH}" \
--data_dir="${TRAIN_PATH}" \
--steps_per_checkpoint=5 \
--from_vocab_size=45000 \
--to_vocab_size=45000
ml_engine log screenshot 1
ml_engine log screenshot 2
Is it the problem with ml_engine or tensorflow?
I followed the blog 'blog.kovalevskyi.com/how-to-train-a-chatbot-with-the-tensorflow-and-google-cloud-ml-3a5617289032' and initially used 'github.com/b0noI/models/tree/translate_tutorial_supports_google_cloud_ml/tutorials/rnn/translate'. It was also giving the same error.
None, it is actually a problem within the code you are uploading,
namely satisfying local dependencies. The filedata_utils.py is located in the same folder as where you got the example from. This is also mentioned in this post you should make sure it is available for your model.