How can I construct a TensorFlow session that connects to an existing set of TF servers?
Following the distributed TensorFlow guide I started a bunch of TensorFlow servers that are connected to form a cluster.
I would now like to start a session that can connect to those TF servers and assign ops to them. I assume I just need to specify an appropriate target in the constructor of tf session; e.g something like
with tf.Session(
target, config=tf.ConfigProto(log_device_placement=True)) as sess:
However its not clear to me how to construct a target object pointing at any existing cluster of TF servers. The docs only show how to get the cluster spec from the server by calling server.target.
Do I need to start another server in order to construct a client that talks to the existing servers?
I want to connect to the TF cluster remotely. My TF servers are running on GCE VMs. I want to connect and assign ops to them from my local machine. Is this possible?
For the target you can just use a string of the form
grpc://<host>:<port>
Where host and port are the hostname (or ip address) and port of one of the TF servers comprising the cluster. It doesn't matter which one because where an op executes depends on the device you assigned it to when you constructed the graph.
Related
I need to boot 100 virtual servers and use each for tensorflow model inference for 30 days. What are some tools to do this?
I currently boot the servers using an image, and open two tmux sessions manually. One session is for the model client and the other is for the tensorflow server. I receive a slack notification if any of the server CPUs stop working as a way to know if a server fails (I also manually SSH to debug/restart the server).
Would appreciate any tips!
I have served several models through tensorflow server.
I want to know how can I get the whole config list through gRPC in my client?
Tensorflow Serving does not offer an API to fetch the config. It is not recommended to use a running server instance as the source of truth for the config, because (1) it can go down, (2) there can be multiple replicas which could be out of sync, and (3) the read-update-write pattern is race-prone. Instead, the recommended approach is to keep the ground truth config in some persistent store e.g. a database or
I'm learning about distributed TensorFlow applications, and I understand jobs, tasks, and servers. A server's target identifies its gRPC location, as in grpc://localhost:1234.
I don't understand what happens when you create a session with a server's target, as in the following code:
with tf.Session(server.target) as sess:
...
The documentation states that server.target identifies the session's execution engine. Another page says that the constructor creates a session on the server. This isn't clear to me. How exactly does a server's target affect the session's execution?
In distributed tensorflow, does a parameter server need to be built as a tensorflow server with both master service and worker service ? If the answer is yes, then can a ps machine also be a worker ?
In distributed TensorFlow, every worker is also a master. Furthermore, TensorFlow runtime has only one kind of worker, unlike its predecessor DistBelief, which had specialized parameter server workers.
You implement traditional parameter server architecture by using some workers to store parameters and others to execute session.run requests.
I installed tensorflow 0.8 by building from source.
I use AWS EC2 g2.8xlarge instance which has 4 GPUs.
I tried to run tensorflow distributed mnist test, code in here:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/dist_test/scripts/dist_mnist_test.sh
my script:
bash dist_mnist_test.sh "grpc://localhost:2223 grpc://localhost:2224"
and I got this message:
E0609 14:53:07.430440599 62872 tcp_client_posix.c:173] failed to connect to 'ipv4:127.0.0.1:2223': socket error: connection refused
E0609 14:53:07.445297934 62873 tcp_client_posix.c:173] failed to connect to 'ipv4:127.0.0.1:2224': socket error: connection refused
Any one know what is wrong here? Thanks a lot!
This script does not run standalone. In particular, it expects that you have created a TensorFlow cluster with workers running at each of the addresses before running the script. The create_tf_cluster.sh script can set up such a cluster using Kubernetes. The dist_test.sh script runs these scripts end-to-end.
See my answer to your other question, which has a suggested script for running MNIST on distributed TensorFlow.
I suspect there's a networking problem here. The first debugging step I would take is to ensure that the sockets 2223 and 2224 are actually being listened to using a tool like netstat. Here's a good general description of how to do that:
https://askubuntu.com/questions/278448/how-to-know-what-program-is-listening-on-a-given-port
If that works, then try using telnet to connect to the socket manually, to make sure the network addressing is working correctly.