Tensorflow network resource usage - tensorflow

To evaluate the quality of job placements in distributed TensorFlow, I want to obtain the total size in bytes of the data sent through the network during training. This is in preparation for further work on the automatic job placement algorithm. Network usage will measure data locality of the training, and is a proxy for training delays.
My plan is to simply record all the sizes of tensors input to _Send nodes, then output and display this in the python profiling Timeline. I've read the related discussions here and here and believe this correct in principle. The only concern is my experiments have shown that Send and Recv nodes are used for communication within a process in addition to inter-process communication - which appears different from what's described in the whitepaper: https://www.tensorflow.org/about/bib.
Are there any caveats to my approach, and is this a good approximation to the actual amount of network used? Also, is data transferred a worthwhile quantity to minimize for minimizing delays, coming from job placements?

Related

tensorflow how to reduce high "device-to-device" load

I profiled a model that I am running and the vast majority of the time in each step (295 of 320ms) is being taken up by "device-to-device" operations (see image). I assume this means loading data from my cpu onto my gpu and back is the bottleneck.
I am running this on a single machine. The data is stored on an SSD and being fed into a GPU.
I am using tensorflow's tf.data.Dataset API and doing all the recommended things like prefetching and num_parallel_calls=tf.data.experimental.AUTOTUNE
My questions are:
(1) Is my assumption correct?
(2) How do I reduce this huge burden on my model?
Tensorboard Profiling Overview
Not a proper answer but it's something; by using tensorflow's mixed precision training I was able to reduce the "device-to-device" time to ~ 145ms. This is still an immense burden compared to everything else profiled and I'd love to be able to reduce it further.
I don't know why this helped either. I assume that mp-training means smaller numbers of bytes are being passed around so maybe that helps.

When is TensorFlow's ParameterServerStrategy preferable to its MultiWorkerMirroredStrategy?

When training a neural network across multiple servers and GPUs, I can't think of a scenario where the ParameterServerStrategy would be preferable to the MultiWorkerMirroredStrategy.
What are the ParameterServerStrategy's main use cases and why would it be better than using MultiWorkerMirroredStrategy?
MultiWorkerMirroredStrategy is intended for synchronous distributed training across multiple workers, each of which can have multiple GPUs
ParameterServerStrategy: Supports parameter servers. It can be used for multi-GPU synchronous local training or asynchronous multi-machine training.
One of the key differences is that ParameterServerStrategy can be used for asynchronous training, while MultiWorkerMirroredStrategy is intended for Synchronous distributed training. In MultiWorkerMirroredStrategy a copy of all variables in the model is kept on each device across all workers, and a communication method is needed to keep all variables in sync. In contrast, in ParameterServerStrategy each variable of the model is placed on one parameter server.
This matters because:
In synchronous training, all the workers are kept in sync in terms of training epochs and steps, other workers would need to wait for the failed or preempted worker to restart to continue. If the failed or preempted worker does not restart for some reason, your workers will keep waiting.
In contrast in ParameterServerStrategy, each worker is running the same code independently, but parameter servers are running a standard server. This means that while each worker will synchronously compute a single gradient update across all GPUs, updates between workers proceed asynchronously. Operations that occur only on the first replica (such as incrementing the global step), will occur on the first replica of every worker. Hence unlike MultiWorkerMirroredStrategy, different workers are not waiting on each other.
I guess the question is, do you expect workers to fail, and will the delay in restarting them slow down training when MultiWorkerMirroredStrategy ? If that is the case, maybe ParameterServerStrategy is better.
EDIT: Answers to questions in comments:
So is the only benefit of PSS the fact that it resists better to
failing workers than MWMS?
Not exactly - even if workers do not fail in MWMS, as workers still need to be in sync there could be network bottle necks.
If so, then I imagine it would only be useful when training on many
workers, say 20 or more, or else the probability that a worker will
fail during training is low (and it can be avoided by saving regular
snapshots).
Maybe not, it depends on the situation. Perhaps in your scenario the probability of failure is low. In someone else's scenario there may be a higher probability. For the same number of workers, the longer a job is, there is more likelihood of a failure occurring in the middle of a job. To illustrate further (with an over simplistic example), if I have the same number of nodes, but theyre simply slower, they could take much longer to do a job, and hence there is greater likelihood of any kind of interruption / failure occurring during the job.
(and it can be avoided by saving regular snapshots).
Not sure I understand what you mean - if a worker fails, and you've saved a snapshot, then you haven't lost data. But the worker still needs to restart. In the interim between failure and restarting other workers may be waiting.
Isn't there a possible benefit with I/O saturation? If the updates are
asynchronous, I/O would be more spread out in time, right? But maybe
this benefit is cancelled by the fact that it uses more I/O? Could you
please detail this a bit?
I will first try to answer it from a conceptual point of view.
I would say try looking at it from a different angle - in a synchronous operation, you're waiting for something else to finish, and you may be idle till that something gives you what you need.
In constrast in an asynchronous operation, you do your own work and when you need more you ask for it.
There is no hard and fast rule about whether synchronous operations or asynchronous operations are better. It depends on the situation.
I will now try to answer it from an optimization point of view:
Isn't there a possible benefit with I/O saturation? If the updates are
asynchronous, I/O would be more spread out in time, right? But maybe
this benefit is cancelled by the fact that it uses more I/O? Could you
please detail this a bit?
In a distributed system it is possible that your bottleneck could be CPU / GPU, Disk or Network. Nowadays networks are really fast, and in some cases faster than disk. Depending on your workers configuration CPU / GPU could be the bottle neck. So it really depends on the configuration of your hardware and network.
Hence I would do some performance testing to determine where the bottlenecks in your system are, and optimize for your specific problem.
EDIT: Additional follow up questions:
One last thing: in your experience, in what use cases is PSS used? I
mean, both PSS and MWMS are obviously for use with large datasets (or
else a single machine would suffice), but what about the model? Would
PSS be better for larger models? And in your experience, is MWMS more
frequently used?
I think cost and the type of problem being worked on may influence the choice. For example, both AWS and GCP offer “spot instances” / “premptible instances” which are heavily discounted servers that can be taken away at any moment. In such a scenario, it may make sense to use PSS - even though machine failure is unlikely, a instance may simply be taken away without notice because it is a “spot instance”. If you use PSS, then the performance impact of servers disappearing may not be as large as when using MWMS.
If you’re using dedicated instances, the instances are dedicated to you, and will not be taken away - the only risk of interruption is machine failure. In such cases MWMS may be more attractive if you can take advantage of performance optimisations or plugin architecture.

what is a "convolution warmup"?

i encountered this phrase few times before, mostly in the context of neural networks and tensorflow, but i get the impression its something more general and not restricted to these environments.
here for example, they say that this "convolution warmup" process takes about 10k iterations.
why do convolutions need to warmup? what prevents them from reaching their top speed right away?
one thing that i can think of is memory allocation. if so, i would expect that it would be solved after 1 (or at least <10) iteration. why 10k?
edit for clarification: i understand that the warmup is a time period or number of iterations that have to be done until the convolution operator reaches its top speed (time per operator).
what i ask is - why is it needed and what happens during this time that makes the convolution faster?
Training neural networks works by offering training data, calculating the output error, and backpropagating the error back to the individual connections. For symmetry breaking, the training doesn't start with all zeros, but by random connection strengths.
It turns out that with the random initialization, the first training iterations aren't really effective. The network isn't anywhere near to the desired behavior, so the errors calculated are large. Backpropagating these large errors would lead to overshoot.
A warmup phase is intended to get the initial network away from a random network, and towards a first approximation of the desired network. Once the approximation has been achieved, the learning rate can be accelerated.
This is an empirical result. The number of iterations will depend on the complexity of your program domain, and therefore also with the complexity of the necessary network. Convolutional neural networks are fairly complex, so warmup is more important for them.
You are not alone to claiming the timer per iteration varies.
I run the same example and I get the same question.And I can say the main reason is the differnet input image shape and obeject number to detect.
I offer my test result to discuss it.
I enable trace and get the timeline at the first,then I found that Conv2D occurrences vary between steps in gpu stream all compulte,Then I use export TF_CUDNN_USE_AUTOTUNE=0 to disable autotune.
then there are same number of Conv2D in the timeline,and the time are about 0.4s .
the time cost are still different ,but much closer!

Performance improvement using Between-Graph replication in distributed tensorflow

I have gone through this
answer, but it didn't give the rationale for choosing multiple clients in Between-Graph replication for improving performance. How will using Between-Graph replication improve performance, when compared to In-Graph replication?
In-graph replication works fine for multiple devices on the same machine, but it doesn't scale well to cluster-size, because one client has to take care of coordination between all devices (even those located on different nodes).
Say, for example, that you have two GPUs, one on the client's machine and another on a second machine. Thanks to Tensorflow's magic, a simple with tf.device('address_of_the_gpu_on_the_other_machine'): will place operations on the remote computer's GPU. The graph will then run on both machines, but data will then need to be gathered from both before being able to proceed in the computation (loss computation, etc). Network communication will slow down your training (and of course, the more machines, the more communication needed).
Between-graph replication, on the other hand, scales much better because each machine has its own client that only needs to coordinate communication to the parameter server and execution of its own operations. Graphs "overlap" on the parameter server, which updates one set of variables that are shared among all the worker graphs. Moreover, communication overhead is also greatly reduced, because now you only need to have fast communication to the parameter servers, but no machine needs to wait for other machines to complete before moving on to the next training iteration.
How are the graphs different between the two methods?
In-graph replication:
In this method, you have only one graph managed by the client. This graph have nodes that are spread over multiple devices, even across different machines. This means that, for example, having two machines PC1 and PC2 on a network, the client will explicitly dispatch operations to one machine or the other. The graph technically is not "replicated", only some parts of it are distributed. Typically, the client has a big batch of data that is split in sub-batches, each of which is fed to a compute-intensive part of the graph. Only this compute-intensive part is replicated, but all the part before the split and after the computation (e.g., loss calculation) runs on the client. This is a bottleneck.
Note, also, that it´'s the client that decides which operations go to which machine, so theoretically one could have different parts of the graph on different nodes. You can decide to replicate identically the compute-intensive part on all your nodes, or you could, in principle, say "all the convolutions are on PC1, all dense layers go to PC2". Tensorflow's magic will insert data transfers where appropriate to make things work for you.
Between-graph replication:
Here you have multiple similar copies of the same graph. Why similar? because all of them have the compute-intensive part (as above), but also the input pipeline, the loss calculation and their own optimizer (assuming you're using asynchronous training (the default). This is another layer of complexity that I'll leave aside). (Delving deeper in Tensorflow's distributed framework, you'll also find out that not all workers (and their graphs) are equal, there is one "chief" worker that does initialization, checkpointing and summary logging, but this is not critical to understanding the general idea).
Unlike above, here you need a special machine, the parameter server (PS), that acts as central repository for the graph's variables (Caveat: not all the variables, only the global ones, like global_step and the weights of your network). You need this because now at each iteration, every worker will fetch the most recent values of the variables at each iteration of the training step. It then sends to the PS the updates that must be applied to the variables and the PS will actually do the update.
How is this different from the method above?
For one thing, there is no "big batch" that gets split among workers. Every worker processes as much data as it can handle, there is no need for splitting and putting things back together afterwards. This means, there is no need for synchronization of workers, because the training loops are entirely independent. The training, however, is not independent, because the updates that worker A does to the variables will be seen by worker B, because they both share the same variables. This means that the more workers you have, the faster the training (subject to diminished returns) because effectively the variables are updated more often (approximately every time_for_a_train_loop/number_of_workers seconds). Again, this happens without coordination between workers, which incidentally also makes the training more robust: if a worker dies, the others can continue (with some caveats due to having a chief worker).
One last cool feature of this method is that, in principle, there is no loss in performance using a heterogeneous cluster. Every machine runs as fast as it can and awaits nobody. Should you try running in-graph replication on a heterogeneous cluster, you'd be limited in speed by the slowest machine (because you collect all results before continuing).

Deep neural network diverges after convergence

I implemented the A3C network in https://arxiv.org/abs/1602.01783 in TensorFlow.
At this point I'm 90% sure the algorithm is implemented correctly. However, the network diverges after convergence. See the attached image that I got from a toy example where the maximum episode reward is 7.
When it diverges, policy network starts giving a single action very high probability (>0.9) for most states.
What should I check for this kind of problem? Is there any reference for it?
Note that in Figure 1 of the original paper the authors say:
For asynchronous methods we average over the best 5
models from 50 experiments.
That can mean that in lot of cases the algorithm does not work that well. From my experience, A3C often diverges, even after convergence. Carefull learning-rate scheduling can help. Or do what the authors did - learn several agents with different seed and pick the one performing the best on your validation data. You could also employ early stopping when validation error becomes to increase.