How does TensorFlow cluster distribute load across machines if not specified explicitly? - tensorflow

I took "Distributed TensorFlow" how-to and tried to apply it to the "MNIST For ML Beginners" tutorial. I started three TensorFlow worker nodes locally (there are 8 cores in the PC) and ran the training script with replacing this line:
sess = tf.InteractiveSession()
with the following:
sess = tf.InteractiveSession("grpc://localhost:12345")
where 12346 is a port where node 0 is listening (e.g. master session is created on node 0). Note that I did not specify explicitly where computations should be performed.
Looking at htop's output, I can see that the job is indeed performed by the cluster - it consumes some CPU. However, the only consumer is node 0, remaining nodes do not perform any work. If I select node 1 as a place to create master session, picture changes: only ~2/3 of the work is performed on node 0 (judging by CPU load), but the remaining 1/3 of the work is performed on node 1. If I select node 2 as master, then that 1/3 of the work is performed on node 2. If I run two processes in parallel, one using node 1 as master and another using node 2 as master, both nodes 1 and 2 get some load, but node 0 is loaded much more (like, 200% vs 60% vs 60% of CPU).
So far it looks like "default" behavior of distributed TensorFlow is not great for parallelizing work automatically right now. I'm wondering what the behavior is and whether distributed TensorFlow is intended for data parallelization at all (as opposed to manual model parallelization)?

TF is great for data parallelization, e.g. when you need to sift through tons of data, which is then distributed to multiple GPUs.
It's also great for weights parallelization. Using tf.train.replica_device_setter, weights are distributed among multiple devices for better IO.
Now, it seems you are asking for parallelization within a single model. That's difficult to do automatically, since TF does not know what's the best way to distribute your computation of the same model to multiple devices. It would depend on too many factors, e.g. how fast is the connection between your devices.

Related

Distribution of MPI-threaded program on available GPUs, using slurm

My program consists out of two parts, A and B, both written in C++. B is loaded from a separate DLL, and is capable of running both on the CPU or on the GPU, depending on how it is linked. When the main program is launched, it creates one instance of A, which in turn creates one instance of B (which then either works on the locally available CPUs or on the first GPU).
When launching the program using mpirun (or via slurm, which in turn launches mpirun), for each MPI rank one version of A is created, which in turn creates one version of B for itself. When only one GPU is in the system, this GPU will be used, but what happens if there are multiple GPUs in the system? Are versions of B all placed on the same GPU, regardless if there are several GPUs available, or are they distributed evenly?
Is there any way to influence that behavior? Unfortunately my development machine does not have multiple GPUs, thus I can not test it, except on production.
Slurm supports and understands binding MPI ranks to GPUs through for-example the --gpu-bind option: https://slurm.schedmd.com/gres.html. Assuming that the cluster is correctly configured to enforce GPU affinities, this will then allow you assign one GPU per rank even if there are multiple ranks on a single node.
If you want to be able to test this, you could use for example the cudaGetDevice and cudaGetDeviceProperties calls to get the device luid (local unique id) for each rank and then check that there is no duplication of luids within a node.

OpenVINO unable to get optimum performance while running multiple inference engines

I am running multiple python processes( 4 in this case using multiprocessing module) for person detection (using ssd mobilenet model), each having it's own inference engine of OpenVINO. I am getting a very low FPS (not more than 10) for each process. My suspicion is the CPUs are not getting utilized optimally because the number of threads being spawned by each engine are high, which is adding to the overhead and also the sharing of CPUs across processes.
Also for single process, I am getting upto 60fps with OMP_NUM_THREADS set to 4.
My CPU details are:-
2 Sockets
4 cores each socket
1 thread each core
Total - 8 CPUs
So what would be the
Optimal value for OMP_NUM_THREADS in this case?
How can I avoid Sharing of CPUs across each process?
Currently I am playing with OMP_NUM_THREADS and KMP_AFFINITY variables, but just doing a hit and trail on setting the values. Any detail on how to set would be really helpful. Thanks
In case of multiple networks inference you may try to set OMP_WAIT_POLICY to PASSIVE.
BTW, OpenVINO 2019R1 moved from OpenMP to TBB. It might give better efficiency in case of deep learning networks pipeline.
In case if you are using the same model for all the processes consider to use OV multi-stream inference. Using this you can load single network and next to create a multiple infer requests. Using this you will have a better CPU utilization (if compare to running one infer request across multiple cores) and in result better throughput.
To understand how to use multi stream inference take a look on inference_engine/samples/python_samples/benchmark_app/benchmark sample
As well you can use benchmark sample to do a grid search to find an optimal configuration (number of streams, batch size).

How to profile distributed TensorFlow?

I have a simple distributed setup in TF (each step roughly corresponds to a tf.Session.run call)
Parent makes changes to variables on the device '/job:parent' via a session with a target for the sole parent task.
Several child processes tf.assign the parent variables to local variables sitting on '/job:child/task:<child index>'.
Each child process then does a bunch of calculations with its own local variables on its own TF device.
The children all have their own master sessions, connected to their personal child task targets.
My setup is either a single GPU or multiple (2) GPUs that are sharded among the parent and children. There are much more children than total GPUs (the memory and compute density are small enough that a GPU cannot be fully utilized by a single child task).
Even with many children, I still find that GPU utilization is low in both single and multiple GPU cases. I suspect that it is the assign operation. These are local GPUs, so it should be possible to execute the assign op via DMA, a vanilla GPU copy (if on the same GPU), or nccl, but I worry that by default TF will be transferring data over local gRPC sockets.
I have several questions:
Is my data transfer understanding correct?
The TF profiler seems directed seems directed to single-process workflows, measuring when ops get executed, etc., within a single session. How can I combine logs from multiple remote sessions? Will tf.RunMetadata capture the gRPC latency?
If I was to run all tf.assign ops on the parent session, doing every copy on each child index, would this avoid the gRPC call?

Is it possible to use TensorFlow Serving with distributed TensorFlow cluster to improve throughput/latency?

I'm looking into ways to improve latency and/or throughput of a TensorFlow Serving instance. I've seen the "Serving Inception" manual and three GitHub Issues (2, 3, 4), but all of them seem to create a separate instance of TensorFlow Serving per server and then choosing server on client. Issue 4 is actually about adding some load balancer in front of that stuff, which is currently absent in TensorFlow Serving itself.
However, there is also "Distributed TensorFlow" tutorial which shows how to join a set of machines into a fixed cluster and then manually "pin" some computations to some machines, which can improve both latency and throughput if model is "wide" and can be parallelized well. However, I do not see any mentions of combining this with TensorFlow Serving in either documentation.
Question is: is it possible to configure TensorFlow Serving to use distributed TensorFlow cluster?
I was able to make it create and use gRPC sessions (instead of local) with some hacks:
Make tensorflow/core/distributed_runtime/rpc:grpc_session target publicly visible (it's internal to tensorflow package by default) by modifying BUILD file.
Add it as a dependency to the tensorflow_serving/model_servers:tensorflow_model_server target.
Add an extra flag to tensorflow_model_server called --session_target which sets up session_bundle_config.session_target() in main.cc.
Run the binary with --session_target=grpc://localhost:12345, where localhost:12345 is an arbitrary node which will be used to create master sessions.
See my cluster performing some computations on behalf of TensorFlow Serving.
However, this set of hacks does not look enough for "real-world usage" for three reasons:
grpc_session target is probably internal for a reason.
As noticed in my other question, distributed TensorFlow works better when computations are manually "pinned" to specific machines. So, if we use TensorFlow Serving, we need a way to save those "pins" and model's structure becomes tied with cluster's structure. I'm not sure whether this information is exported with Exporter/Saver at all.
tensorflow_model_server creates session once - during bootstrap. If master node of the cluster goes down and then restores, serving server still holds the "old" session and cannot process further requests.
All in all, it looks like this scenario is not officially supported yet, but I'm not sure.
If your model fits into single machine, then it's hard to see how distributing it over many machines will improve throughput. Essentially you are taking computations which can be done independently and adding a dependency. If one of your machines is slow or crashes, instead of making some queries slow, it will make all queries sow.
That said, it's worth benchmarking to see if it helps, in which case it would make sense to ask for this use-case to be officially supported.
Regarding questions:
Worker assignments are done through device field in graph .pbtxt. Some importers/exporters clear those assignments and have clear_devices flag. You could open graph definition (.pbtxt file or equivalently, str(tf.get_default_graph().as_graph_def(), and grep for device strings to check)
If any worker restarts, or there's some temporary network connectivity your sess.run fails with error (Unavailable) and you need to recreate the session. This is handled automatically by MonitoredTrainingSession in tf.train, but you need to handle this yourself with serving.
If your model is not using images, or is not entirely too large, you shouldn't need too much compute for each inference/serve, and I'm saying this using Inception-v# which takes ~1 sec to serve a response to an image on a Google Cloud Platform n1-standard-1 machine.
Now that being said, perhaps its the throughput that you need to scale up and that is a different problem. Your best option for scale at that point would be to use Docker Swarm & Compose, as well as Kubernetes to help scale e up and serve your inference "micro-service". You could use flask to iterate over a sequence of requests also if your use-case warrants it.

multiple workers on a single machine in distributed TensorFlow

is it possible to launch distributed TensorFlow on a local machine, in a way that each worker has a replica of the model?
if yes, is it possible to assign each agent to utilize only a single CPU core?
Yes it is possible to launch a distributed Tensorflow locally:
Each task typically runs on a different machine, but you can run multiple tasks on the same machine (e.g. to control different GPU devices).
and in such a way that each worker has the same graph:
If you are using more than one graph (created with tf.Graph()) in the same process, you will have to use different sessions for each graph, but each graph can be used in multiple sessions.
As mentioned by in your comments, there is a suggestion of how to try and achieve execution of distributed TF to a single core which involves distributing to different CPUs and then limiting the thread pool to a single thread.
Currently there is no feature that allows the distributed execution of TF graphs to particular cores.
To your first question, the answer is yes. More details here: https://www.tensorflow.org/versions/r0.9/how_tos/distributed/index.html
For the second question, I'm not sure if Tensorflow has this level of fine-grained control at core-level; but in general the OS will load balance threads on multiple cores.
Note that Tensorflow does have the ability to specify a device at processor level, if you have multiple CPUs/GPUs.