is it possible to launch distributed TensorFlow on a local machine, in a way that each worker has a replica of the model?
if yes, is it possible to assign each agent to utilize only a single CPU core?
Yes it is possible to launch a distributed Tensorflow locally:
Each task typically runs on a different machine, but you can run multiple tasks on the same machine (e.g. to control different GPU devices).
and in such a way that each worker has the same graph:
If you are using more than one graph (created with tf.Graph()) in the same process, you will have to use different sessions for each graph, but each graph can be used in multiple sessions.
As mentioned by in your comments, there is a suggestion of how to try and achieve execution of distributed TF to a single core which involves distributing to different CPUs and then limiting the thread pool to a single thread.
Currently there is no feature that allows the distributed execution of TF graphs to particular cores.
To your first question, the answer is yes. More details here: https://www.tensorflow.org/versions/r0.9/how_tos/distributed/index.html
For the second question, I'm not sure if Tensorflow has this level of fine-grained control at core-level; but in general the OS will load balance threads on multiple cores.
Note that Tensorflow does have the ability to specify a device at processor level, if you have multiple CPUs/GPUs.
Related
I would like to implement mirrored strategy using cpu's but i dont know how to frame the parameters to be passed to mirroredstrategy(). This is the line of code as it is for gpu's, distribution = tf.contrib.distribute.MultiworkerMirroredStrategy(["/device:GPU:0", "/device:GPU:1", "/device:GPU:2"])
i could change "/device:GPU:0", to "/device:CPU:0", but that seems to only use one core or does it , how would i check?
TensorFlow can make use of multiple CPU cores out of the box, so you do not need to use a strategy in this case. MultiworkerMirroredStrategy is only needed if you want to train with multiple machines. Those machines can each have GPU(s) or be CPU only.
I am running multiple python processes( 4 in this case using multiprocessing module) for person detection (using ssd mobilenet model), each having it's own inference engine of OpenVINO. I am getting a very low FPS (not more than 10) for each process. My suspicion is the CPUs are not getting utilized optimally because the number of threads being spawned by each engine are high, which is adding to the overhead and also the sharing of CPUs across processes.
Also for single process, I am getting upto 60fps with OMP_NUM_THREADS set to 4.
My CPU details are:-
2 Sockets
4 cores each socket
1 thread each core
Total - 8 CPUs
So what would be the
Optimal value for OMP_NUM_THREADS in this case?
How can I avoid Sharing of CPUs across each process?
Currently I am playing with OMP_NUM_THREADS and KMP_AFFINITY variables, but just doing a hit and trail on setting the values. Any detail on how to set would be really helpful. Thanks
In case of multiple networks inference you may try to set OMP_WAIT_POLICY to PASSIVE.
BTW, OpenVINO 2019R1 moved from OpenMP to TBB. It might give better efficiency in case of deep learning networks pipeline.
In case if you are using the same model for all the processes consider to use OV multi-stream inference. Using this you can load single network and next to create a multiple infer requests. Using this you will have a better CPU utilization (if compare to running one infer request across multiple cores) and in result better throughput.
To understand how to use multi stream inference take a look on inference_engine/samples/python_samples/benchmark_app/benchmark sample
As well you can use benchmark sample to do a grid search to find an optimal configuration (number of streams, batch size).
I've heard there are performance problems with "in-graph replication". What are they? Why do they exist?
Within-graph replication uses a single client process to drive execution, so this can become a bottleneck during replicated training.
Suppose you do model parallelism with 50 independent workers using single client.
This client would need to construct computation graph for all replicas. So if you have 50 replicas, this client has to handle 50x larger graph.
This client would need to drive session run calls for all replicas. If you are using Python client, you could have 50 threads in your client issuing session.run calls concurrently, but Python has performance issues with threads (GIL and thread scheduling issues like here)
I took "Distributed TensorFlow" how-to and tried to apply it to the "MNIST For ML Beginners" tutorial. I started three TensorFlow worker nodes locally (there are 8 cores in the PC) and ran the training script with replacing this line:
sess = tf.InteractiveSession()
with the following:
sess = tf.InteractiveSession("grpc://localhost:12345")
where 12346 is a port where node 0 is listening (e.g. master session is created on node 0). Note that I did not specify explicitly where computations should be performed.
Looking at htop's output, I can see that the job is indeed performed by the cluster - it consumes some CPU. However, the only consumer is node 0, remaining nodes do not perform any work. If I select node 1 as a place to create master session, picture changes: only ~2/3 of the work is performed on node 0 (judging by CPU load), but the remaining 1/3 of the work is performed on node 1. If I select node 2 as master, then that 1/3 of the work is performed on node 2. If I run two processes in parallel, one using node 1 as master and another using node 2 as master, both nodes 1 and 2 get some load, but node 0 is loaded much more (like, 200% vs 60% vs 60% of CPU).
So far it looks like "default" behavior of distributed TensorFlow is not great for parallelizing work automatically right now. I'm wondering what the behavior is and whether distributed TensorFlow is intended for data parallelization at all (as opposed to manual model parallelization)?
TF is great for data parallelization, e.g. when you need to sift through tons of data, which is then distributed to multiple GPUs.
It's also great for weights parallelization. Using tf.train.replica_device_setter, weights are distributed among multiple devices for better IO.
Now, it seems you are asking for parallelization within a single model. That's difficult to do automatically, since TF does not know what's the best way to distribute your computation of the same model to multiple devices. It would depend on too many factors, e.g. how fast is the connection between your devices.
I'm looking into ways to improve latency and/or throughput of a TensorFlow Serving instance. I've seen the "Serving Inception" manual and three GitHub Issues (2, 3, 4), but all of them seem to create a separate instance of TensorFlow Serving per server and then choosing server on client. Issue 4 is actually about adding some load balancer in front of that stuff, which is currently absent in TensorFlow Serving itself.
However, there is also "Distributed TensorFlow" tutorial which shows how to join a set of machines into a fixed cluster and then manually "pin" some computations to some machines, which can improve both latency and throughput if model is "wide" and can be parallelized well. However, I do not see any mentions of combining this with TensorFlow Serving in either documentation.
Question is: is it possible to configure TensorFlow Serving to use distributed TensorFlow cluster?
I was able to make it create and use gRPC sessions (instead of local) with some hacks:
Make tensorflow/core/distributed_runtime/rpc:grpc_session target publicly visible (it's internal to tensorflow package by default) by modifying BUILD file.
Add it as a dependency to the tensorflow_serving/model_servers:tensorflow_model_server target.
Add an extra flag to tensorflow_model_server called --session_target which sets up session_bundle_config.session_target() in main.cc.
Run the binary with --session_target=grpc://localhost:12345, where localhost:12345 is an arbitrary node which will be used to create master sessions.
See my cluster performing some computations on behalf of TensorFlow Serving.
However, this set of hacks does not look enough for "real-world usage" for three reasons:
grpc_session target is probably internal for a reason.
As noticed in my other question, distributed TensorFlow works better when computations are manually "pinned" to specific machines. So, if we use TensorFlow Serving, we need a way to save those "pins" and model's structure becomes tied with cluster's structure. I'm not sure whether this information is exported with Exporter/Saver at all.
tensorflow_model_server creates session once - during bootstrap. If master node of the cluster goes down and then restores, serving server still holds the "old" session and cannot process further requests.
All in all, it looks like this scenario is not officially supported yet, but I'm not sure.
If your model fits into single machine, then it's hard to see how distributing it over many machines will improve throughput. Essentially you are taking computations which can be done independently and adding a dependency. If one of your machines is slow or crashes, instead of making some queries slow, it will make all queries sow.
That said, it's worth benchmarking to see if it helps, in which case it would make sense to ask for this use-case to be officially supported.
Regarding questions:
Worker assignments are done through device field in graph .pbtxt. Some importers/exporters clear those assignments and have clear_devices flag. You could open graph definition (.pbtxt file or equivalently, str(tf.get_default_graph().as_graph_def(), and grep for device strings to check)
If any worker restarts, or there's some temporary network connectivity your sess.run fails with error (Unavailable) and you need to recreate the session. This is handled automatically by MonitoredTrainingSession in tf.train, but you need to handle this yourself with serving.
If your model is not using images, or is not entirely too large, you shouldn't need too much compute for each inference/serve, and I'm saying this using Inception-v# which takes ~1 sec to serve a response to an image on a Google Cloud Platform n1-standard-1 machine.
Now that being said, perhaps its the throughput that you need to scale up and that is a different problem. Your best option for scale at that point would be to use Docker Swarm & Compose, as well as Kubernetes to help scale e up and serve your inference "micro-service". You could use flask to iterate over a sequence of requests also if your use-case warrants it.