What is gradient repacking in Tensorflow? - tensorflow

When running tensorflow benchmarks from terminal, there are a couple of parameters we can specify. There is a parameter called gradient_repacking. What does it represent and how would one think about setting it?
python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=256 \
--model=resnet50 --optimizer=momentum --variable_update=replicated \
--nodistortions --gradient_repacking=8 --num_gpus=8 \
--num_epochs=90 --weight_decay=1e-4 --data_dir=${DATA_DIR} --use_fp16 \
--train_dir=${CKPT_DIR}

For those searching in the future, gradient_repacking affects all-reduce in replicated mode. From the flags definition:
flags.DEFINE_integer('gradient_repacking', 0, 'Use gradient repacking. It'
'currently only works with replicated mode. At the end of'
'of each step, it repacks the gradients for more efficient'
'cross-device transportation. A non-zero value specifies'
'the number of split packs that will be formed.',
lower_bound=0)
As for the optimal, I've seen gradient_repacking=8 as you have and gradient_repacking=2.
My best guess is the parameter refers to the number of shards the gradients get broken down into for sharing among other workers. Eight in this case would seem to mean each GPU shares with each other GPU (i.e. all-to-all) (for your num_gpus=8) while 2 would mean sharing only with neighbors in a ring fashion.
Given that Horovod uses its own all reduce algorithm, it makes sense that setting gradient_repacking has no effect when --variable_update=horovod.

Related

How to get multi GPUs same type on slurm?

How can I create a job with a multi GPU of the same type but not specific that type directly? My experiment has a constraint that all GPUs have the same type but this type can be whatever we want.
Currently I am able only to create a experiment with multi GPUs with telling exactly what type I want:
--gres=gpu:gres_type:amount
If I don't specify gres_type, then sometimes I get mixed GPUs packs (let say 2x titan V and 2x titan X).
If you are fortunate enough that the cluster is consistent in the types of nodes that host the GPUs, and that the features of the nodes a properly specified and allow distinguishing between the nodes that host the different GPU types, you can use the --constraint parameter.
For the sake of the argument, let's assume that the nodes that host the titanV have haswell CPUs, and those that host the titanX have skylake CPUs and that those are defined as features. Then, you can request
--gres=gpu:2
--constraint=[haswell|skylake]
If the above does not apply to your use case, you can submit two jobs and keep only the one that starts the earliest. For that, give your jobs an identical name, and use the singleton dependency.
Write a submission script like this one
#!/bin/bash
#SBATCH --dependency=singleton
#SBATCH --job-name=gpujob
# Other options
scancel --state=PENDING --jobname=gpujob
# etc.
and submit it twice with
$ sbatch --gres=gpu:titanX:2 submit.sh
$ sbatch --gres=gpu:titanV:2 submit.sh
Each job will be assigned only one type of GPU, and the first one that starts will cancel the other one. This approach can scale up with more than two GPU types.

Explanation of parallel arguments of tf.while_loop in TensorFlow

I want to implement an algorithm which allows a parallel implementation in TensorFlow. My question is what the arguments parallel_iterations, swap_memory and maximum_iterations actually do and which are their appropriate values according the situation. Specifically, in the documentation on TensorFlow's site https://www.tensorflow.org/api_docs/python/tf/while_loop says that parallel_iterations are the number of iterations allowed to run in parallel. Is this number the number of threads? When someone should allow CPU-GPU swap memory and for what reason? What are the advantages and disadvantages from this choice? What is the purpose of maximum_iterations? Can it be combined with parallel_iterations?
swap_memory is used when you want to have extra memory on the GPU device. Usually when you are training a model some activations are saved in the GPU mem. for later use. With swap_memory, you can store those activations in the CPU memory and use the GPU mem. to fit e.g. larger batch sizes. And this is an advantage. You would choose this if you need big batch_sizes or have long sequences and want to avoid OOM exceptions. Disadvantage is computation time since you need to transfer the data from CPU mem. to GPU mem.
The maximum iterations is smth. like this:
while num_iter < 100 and <some condition>:
do something
num_iter += 1
So it is useful when you check a condition, but also want to have an upper bound (one example is to check if your model converges. If it doesn't you still want to end after k iterations.)
As for parallel_iterations I am not sure, but it sounds like multiple threads, yes. You can try and see the effect in a sample script.

GPU cannot be indentified to run module

I have 2 GPU machine available to use with ID 2 and 3, and would like to use them all to fit model. Here is my code,
os.environ['CUDA_VISIBLE_DEVICES'] = '2, 3'
with tf,device('/gpu:2'):
critic_model.fit(x,y,epochs =10)
with tf.device('/gpu:3'):
history = model.fit(x,y,epochs=19)
However, when I check nvidia-smi, I found only machine 2 is utilized, I wonder why ?
Any idea could be helpful !
Multiple problems here:
For one, Tensorflow has "its own" GPU numbering independent from the IDs on your machine. So when you pass CUDA_VISIBLE_DEVICES=2,3, Tensorflow will see those two GPUs, but they will be '/gpu:0' and '/gpu:1' in the program. Since neither '/gpu:2' nor '/gpu:3' exist, I suspect that all ops are simply put on '/gpu:0' or the CPU.
However, the main problem is that this is not how you use with tf.device at all. You need to wrap the model creation into the context manager. I.e. all the op calls such as tf.nn.conv2d, tf.matmul etc. need to be wrapped. At the point you call model.fit, the ops have already been created (and put on '/gpu:0' by default) and your with tf.device statement does nothing.

How to run TensorFlow on multiple nodes with several CPUs each

I want to run linear regression with TensorFlow on very large datasets. I have a cluster with 9 nodes and 36 CPUs each. What is the best way to distribute the computations across all the resources available?
According to this course https://www.coursera.org/learn/intro-tensorflow, the best way to use TensorFlow on distributed setting is to use Estimators. So I wrote my code as suggested there and followed the instructions at https://www.tensorflow.org/deploy/distributed for the parallelisation. I then tried to run my script my_code.py (on a "small" dataset with 120 million data points and 2 feature columns to test the code) on nodes 2 and 3 as follows:
python my_code.py \
--ps_hosts=node1:2222 \
--worker_hosts=node2:2222,node3:2222
--job_name=worker
--task_index="i-2"
where i is the number of the node (either 2 or 3); whereas on node 1 I do the same but with --job_name=ps and --task_index=0. However this way it seems that only one CPU per node is used. Do I need to specify each CPU individually?
Thank you in advance.
As far as I understand, the best thing to do is to use all the CPUs on the same node together as a single worker, in order to make the most of the shared memory. So for example in the case above, one would have to specify manually only 9 workers and make sure that each of them corresponds to one node where all the 36 CPUs are used. The commands to do this depend on the specific cluster used.

Machine learning: why the cost function does not need to be derivable?

I was playing around with Tensorflow creating a customized loss function and this question about general machine learning arose to my head.
My understanding is that the optimization algorithm needs a derivable cost function to find/approach a minimum, however we can use functions that are non-derivable such as the absolute function (there is no derivative when x=0). A more extreme example, I defined my cost function like this:
def customLossFun(x,y):
return tf.sign(x)
and I expected an error when running the code, but it actually worked (it didn't learn anything but it didn't crash).
Am I missing something?
You're missing the fact that the gradient of the sign function is somewhere manually defined in the Tensorflow source code.
As you can see here:
def _SignGrad(op, _):
"""Returns 0."""
x = op.inputs[0]
return array_ops.zeros(array_ops.shape(x), dtype=x.dtype)
the gradient of tf.sign is defined to be always zero. This, of course, is the gradient where the derivate exists, hence everywhere but not in zero.
The tensorflow authors decided to do not check if the input is zero and throw an exception in that specific case
In order to prevent TensorFlow from throwing an error, the only real requirement is that you cost function evaluates to a number for any value of your input variables. From a purely "will it run" perspective, it doesn't know/care about the form of the function its trying to minimize.
In order for your cost function to provide you a meaningful result when TensorFlow uses it to train a model, it additionally needs to 1) get smaller as your model does better and 2) be bounded from below (i.e. it can't go to negative infinity). It's not generally necessary for it to be smooth (e.g. abs(x) has a kink where the sign flips). Tensorflow is always able to compute gradients at any location using automatic differentiation (https://en.wikipedia.org/wiki/Automatic_differentiation, https://www.tensorflow.org/versions/r0.12/api_docs/python/train/gradient_computation).
Of course, those gradients are of more use if you've chose a meaningful cost function isn't isn't too flat.
Ideally, the cost function needs to be smooth everywhere to apply gradient based optimization methods (SGD, Momentum, Adam, etc). But nothing's going to crash if it's not, you can just have issues with convergence to a local minimum.
When the function is non-differentiable at a certain point x, it's possible to get large oscillations if the neural network converges to this x. E.g., if the loss function is tf.abs(x), it's possible that the network weights are mostly positive, so the inference x > 0 at all times, so the network won't notice tf.abs. However, it's more likely that x will bounce around 0, so that the gradient is arbitrarily positive and negative. If the learning rate is not decaying, the optimization won't converge to the local minimum, but will bound around it.
In your particular case, the gradient is zero all the time, so nothing's going to change at all.
If it didn't learn anything, what have you gained ? Your loss function is differentiable almost everywhere but it is flat almost anywhere so the minimizer can't figure out the direction towards the minimum.
If you start out with a positive value, it will most likely be stuck at a random value on the positive side even though the minima on the left side are better (have a lower value).
Tensorflow can be used to do calculations in general and it provides a mechanism to automatically find the derivative of a given expression and can do so across different compute platforms (CPU, GPU) and distributed over multiple GPUs and servers if needed.
But what you implement in Tensorflow does not necessarily have to be a goal function to be minimized. You could use it e.g. to throw random numbers and perform Monte Carlo integration of a given function.