Distributed TensorFlow hangs during CreateSession - tensorflow

I am new to distributed TensorFlow. Right now I am just trying to get some existing examples to work so I can learn how to do it right.
I am following the instruction here to train the inception network on one Linux machine with one worker and one PS.
https://github.com/tensorflow/models/tree/master/research/inception#how-to-train-from-scratch-in-a-distributed-setting
The program hangs during CreateSession with the message:
CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
This my command to start a worker:
./bazel-bin/inception/imagenet_distributed_train \
--batch_size=32 \
--data_dir=/datasets/BigLearning/jinlianw/imagenet_tfrecords/ \
--job_name='worker' \
--task_id=0 \
--ps_hosts='localhost:2222' \
--worker_hosts='localhost:2223'
This is my command to start a PS:
./bazel-bin/inception/imagenet_distributed_train \
--job_name='ps' \
--task_id=0 \
--ps_hosts='localhost:2222' \
--worker_hosts='localhost:2223'
And the PS process hangs after printing:
2018-06-29 21:40:43.097361: I
tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:332]
Started server with target: grpc://localhost:2222
Is the inception model still a valid example for distributed TensorFlow or did I do something wrong?
Thanks!

Problem resolved. Turns out it's due to GRPC. My cluster machines have an environment variable http_proxy set. Unset this variable solves the problem.

Related

Simple cluster manager for small Tensorflow distributed training?

I'm just getting into distributed training with Tensorflow. At the moment, I run 4 processes on the same computer on different ports:
python trainer.py \
--model models/my_model
--model_dir model_dir/my_model \
--train_set data/train.csv \
--val_set data/val.csv \
--cluster_spec '{ \
"environment":"cloud" \
"cluster":{ \
"chief": ["localhost:2221"], \
"worker":["localhost:2222"], \
"ps":["localhost:2220"] \
}, \
"task":{ \
"type":"chief", \
"index":0 \
}, \
}'
The only thing that changes for each process is the end of the --cluster_spec flag where the values for task are specific for each role for each process.
Now I'm thinking about using the three computers I have at home instead of using just different processes within the same machine.
Question
Other than Kubernetes, what cluster management software could I use to simplify launching and watching those four processes across three different computers connected via WiFi? Ideally, this would be something very approachable for someone who's never done automated cluster management before.

RuntimeError: `RunConfig` instance is expected to be used by the `Estimator` inside the `Experiment`

Trying to run this tutorial experiment:
https://cloud.google.com/ml-engine/docs/how-tos/getting-started-training-prediction#local-train-single
Running locally, in virtualenv, Python v2.7, TensorFlow v1.2
When executing this command:
gcloud ml-engine local train \
--module-name trainer.task \
--package-path trainer/ \
-- \
--train-files $TRAIN_DATA \
--eval-files $EVAL_DATA \
--train-steps 1000 \
--job-dir $MODEL_DIR \
--eval-steps 100
I get the following error:
RuntimeError: `RunConfig` instance is expected to be used by the `Estimator` inside the `Experiment`.
expected
_cluster_spec={},
_environment=u'cloud',
_evaluation_master='',
_is_chief=True,
_master='',
_model_dir='output',
_num_ps_replicas=0,
_num_worker_replicas=0,
_task_id=0,
_task_type=None,
_tf_config=<tensorflow.core.protobuf.config_pb2.ConfigProto object at 0x111833938>,
but got
_cluster_spec={},
_environment=u'cloud',
_evaluation_master='',
_is_chief=True,
_master='',
_model_dir='output',
_num_ps_replicas=0,
_num_worker_replicas=0,
_task_id=0,
_task_type=None,
_tf_config=<tensorflow.core.protobuf.config_pb2.ConfigProto object at 0x111833c80>
So, it appears it is not finding the _tf_config at the right address. Have not been able to find documentation on how to set that up. Thanks.
UPDATE:
Appears to have something to do with my virtualenv setup. Works fine when I install tensorflow natively.

Cannot run Mesos Containers with GPU tasks

I am running Mesos on Ubuntu and am trying to execute:
mesos-execute \
--master=$(cat /etc/mesos/zk) \
--name=gpu-test \
--docker_image=nvidia/cuda \
--command="nvidia-smi" \
--framework_capabilities="GPU_RESOURCES" \
--resources="gpus:1"
and it is failing because: sh: 1: nvidia-smi: not found
even though when I run it without container support
mesos-execute \
--master=$(cat /etc/mesos/zk) \
--name=gpu-test \
--command="nvidia-smi" \
--framework_capabilities="GPU_RESOURCES" \
--resources="gpus:1"
it has access to the gpu
plus if I run it without container support but put the command as
nvidia-docker run -it nvidia/cuda nvidia-smi
it works, so it seems that the mesos containerizer doesnt have access to the GPUs. But in the /etc/mesos-slave/ directory I gave it containerizers mesos (and all the other required flags to run gpu commands). Plus non-gpu related commands are working fine.
This looks like a regression in 1.3.0. I downgraded to 1.2.1 on Ubuntu and can successfully use GPUs with Docker containers and the Mesos containerizer again.
sudo apt-get install mesos=1.2.1-2.0.1
It looks like someone filed a related bug but there's been no activity:
https://issues.apache.org/jira/browse/MESOS-7730

rnn translate showing data_utils not found in google-cloud-ml-engine

I want to create a chatbot using Tensorflow.I am using the code in 'github.com/tensorflow/models/tree/master/tutorials/rnn/translate'.While running the code in google-cloud-ml-engine I am getting an exception '/usr/bin/python: No module named data_utils' and the job is getting failed.
Here is the commands I used,
gcloud ml-engine jobs submit training ${JOB_NAME} \
--package-path=. \
--module-name=translate.translate \
--staging-bucket="${TRAIN_BUCKET}" \
--region=us-central1 \
-- \
--from_train_data=${INPUT_TRAIN_DATA_A} \
--to_train_data=${INPUT_TRAIN_DATA_B} \
--from_dev_data=${INPUT_TEST_DATA_A} \
--to_dev_data=${INPUT_TEST_DATA_B} \
--train_dir="${TRAIN_PATH}" \
--data_dir="${TRAIN_PATH}" \
--steps_per_checkpoint=5 \
--from_vocab_size=45000 \
--to_vocab_size=45000
ml_engine log screenshot 1
ml_engine log screenshot 2
Is it the problem with ml_engine or tensorflow?
I followed the blog 'blog.kovalevskyi.com/how-to-train-a-chatbot-with-the-tensorflow-and-google-cloud-ml-3a5617289032' and initially used 'github.com/b0noI/models/tree/translate_tutorial_supports_google_cloud_ml/tutorials/rnn/translate'. It was also giving the same error.
None, it is actually a problem within the code you are uploading,
namely satisfying local dependencies. The filedata_utils.py is located in the same folder as where you got the example from. This is also mentioned in this post you should make sure it is available for your model.

Tensorflow Inception FeedInputs: unable to find feed output input

I tried the inception tutorial in tensorflow site:
https://www.tensorflow.org/versions/r0.12/how_tos/image_retraining/
the bazel build is done successfully but when I try to predict an image class wth this command:
bazel build tensorflow/examples/label_image:label_image && \
bazel-bin/tensorflow/examples/label_image/label_image \
--graph=/tmp/output_graph.pb --labels=/tmp/output_labels.txt \
--output_layer=final_result \
--image=$HOME/flower_photos/daisy/21652746_cc379e0eea_m.jpg
I have this error:
tensorflow/examples/label_image/main.cc:305] Running model failed: Not found: FeedInputs: unable to find feed output input
How can I solve this problem
This thread helped me to fix this issue.
It seems that we need to provide --input_layer with Tensorflow 1.0+.
In your case, this should fix the problem:
bazel build tensorflow/examples/label_image:label_image && \
bazel-bin/tensorflow/examples/label_image/label_image \
--graph=/tmp/output_graph.pb --labels=/tmp/output_labels.txt \
--output_layer=final_result \
--image=$HOME/flower_photos/daisy/21652746_cc379e0eea_m.jpg \
--input_layer=Mul
Are you using Tensorflow 1.0+? I had the same issue but switching over to an earlier version (I used 0.12.0) resolved the issue. It must be something in the 1.0.0 update that broke the tutorial