TensorFlow Slim - Clone on cpu - tensorflow

What does 'Use CPUs to deploy clones' mean in the following snippet (slim/train_image_classifier.py):
tf.app.flags.DEFINE_boolean(
'clone_on_cpu', False,
'Use CPUs to deploy clones.'
)

Use CPUs to deploy clones' mean
In general setup model losses and gradients are calculated on GPUs, a single clone use a single GPU. For multi GPUs training multiples clones are created. If you have 4 GPUs 4 clones are created and loss for separate batches are computed simultaneously (data parallelism). That said, Now if you don't have GPUs you can use multiple CPUs to for data parallelism ( will be slower than GPU off course). USE CPUs to deploy clones option let you use CPUs for data parallelism; to compute model losses and gradients on cpus.

Related

How to merge two or more trained weights?

I implemented a 5x5 Gomoku by CNN + DQN.
Here is the github link:
https://github.com/bokidigital/CNN_DQN_5x5_Gomoku
My problem is this code has no parallelization design.
This means when running this code on an Intel Skylake server ( 2 CPU, 80 cores ), the CPU usage is just about 90%.
I think the idea CPU usage should be 8000% ( 80 cores ).
Since I have some customized rule in gaming ( not only neural network part which consumes GPU about 75% ), it consumes CPU and no parallelization.
My environment is:
Skylake CPU X 2
NVIDIA P100 X 2 ( only use 1 )
40GB RAM
Tensorflow 1.14.0
Keras
Python 3.7
Ubuntu 16.04
My idea is to run this program separately ( run many copies of this process in the different folder which then generates different weights ) then CPU usage could reach ideally to 8000% ( as long as many processes run at the same time )
Since it is the training process, it doesn't matter how each process trained their weights.
Q1. The problem is how to merge their results(the trained weights)? (A+B)/2?
Q2. It seems 1 GPU can only be used by 1 process, I tried to run 3 process at the same time, the GPU seems to hang.
Q3. If I disabled GPU, 80 core Skylake will faster than NVIDIA P100?
I expect to use more CPU usage to speed up this training process.
Since 5x5 agent trained with 5 days, I tested the same code but change grid size to 9x9, I estimated the training time needs 3 months.

Tuning Tensorflow Estimators based on batch size, hash bucket size, memory etc on CPU?

We're testing out various estimators such as LinearEstimator, DNNClassifier etc. Right now we are restricted to use only CPU for training, and we're testing out parameters and levers such as
CPU: 8~32 cpu's
Memory: 16~48 GB
Batch/Buffer size(dataset.batch(n)) : n=128~512
Hash bucket_size: 10,000 ~ 500,000
Number of threads: Tensorflow default, which should be number of logical cores
Optimizer: GradientDescent, FtrlOptimizer
Result: global steps per second * batch_size of around 20~50
So we're getting via Tensorboard global steps per second * bucket_size of around 20~50, and increasing CPU and memory has its limits.
We see similar results regardkess of Optimizer and its configurations.
Are we doing something wrong, and are there other levers we can use? Is there a limit as to how much you can optimize your model training methods, and should we move on to GPU's and take advantage of its matrix multiplication efficiencies?
You can try optimizing your input pipeline with Dataset API. Consider converting your data to tfrecords, it can give substantial improvements. If you have multiple CPUs you can setup a cluster. But it all depends heavily on what data you have. And take a look
https://www.tensorflow.org/guide/performance/datasets
https://www.tensorflow.org/guide/performance/overview

Is there a way to run each segment of a batch on different GPU?

So I am very new to Tensorflow and GPUs and I was wondering if I can feed different segments of my batch to different GPUs and aggregate the result at the end. What I mean is that, let's say that batch size in each epoch of my training is 1600 and I have 4 GPUs. Can I feed batches of size 400 to each GPU during each epoch of training and then aggregate the result?
You can do that. You will have to perform multi-gpu training.
Though in TensorFlow you can do a tower-based design where you collect and aggregate the gradients from each tower before backpropagation, it is not so simple and efficient.
You should use horovod which is easy and efficient.

How to fix low volatile GPU-Util with Tensorflow-GPU and Keras?

I have a 4 GPU machine on which I run Tensorflow (GPU) with Keras. Some of my classification problems take several hours to complete.
nvidia-smi returns Volatile GPU-Util which never exceeds 25% on any of my 4 GPUs.
How can I increase GPU Util% and speed up my training?
If your GPU util is below 80%, this is generally the sign of an input pipeline bottleneck. What this means is that the GPU sits idle much of the time, waiting for the CPU to prepare the data:
What you want is the CPU to keep preparing batches while the GPU is training to keep the GPU fed. This is called prefetching:
Great, but if the batch preparation is still way longer than the model training, the GPU will still remain idle, waiting for the CPU to finish the next batch. To make the batch preparation faster we can parallelize the different preprocessing operations:
We can go even further by parallelizing I/O:
Now to implement this in Keras, you need to use the Tensorflow Data API with Tensorflow version >= 1.9.0. Here is an example:
Let's assume, for the sake of this example that you have two numpy arrays x and y. You can use tf.data for any type of data but this is simpler to understand.
def preprocessing(x, y):
# Can only contain TF operations
...
return x, y
dataset = tf.data.Dataset.from_tensor_slices((x, y)) # Creates a dataset object
dataset = dataset.map(preprocessing, num_parallel_calls=64) # parallel preprocessing
dataset = dataset.batch(batch_size)
dataset = dataset.prefetch(None) # Will automatically prefetch batches
....
model = tf.keras.model(...)
model.fit(x=dataset) # Since tf 1.9.0 you can pass a dataset object
tf.data is very flexible, but as anything in Tensorflow (except eager), it uses a static graph. This can be a pain sometimes but the speed up is worth it.
To go further, you can have a look at the performance guide and the Tensorflow data guide.
I've got similar issue - the memory of all the GPUs were allocated by Keras, but Volatile was around 0% and training was taking almost the same amount of time as on CPU. I was using ImageDataGenerator, which turned out to be a bottleneck. When I increased the number of workers in fit_generator method from default value 1 to all available CPUs, then the training time dropped rapidly.
You can also load the data to the memory and then use flow method to prepare batches with augmented images.

Tensorflow batching is very slow

I tried to setup a very simple Mnist example with an Estimator.
First I used the estimator's deprecated fit() parameters x, y and batch_size. This executed very fast and utilized about 100% of my GPU while not effecting the CPU much (about 10% utilization). So it worked as expected.
Because the x, y and batch_size parameters are deprecated, I wanted to use the input_fn parameter for the fit() function. To build the input_fn, I used a tf.slice_input_producer and batched it with tf.train.batch. This is my code https://gist.github.com/andreas-eberle/11f650fca0dce4c9d3d6c0955145e80d. You should be able to just run it with tensorflow 1.0.
My problem is that the training now runs very slow and only utilizes about 30 % of my GPU (shown in nvidia-smi).
I also tried to increase the queue capacity of the slice_input_producer and to increase the number of threads used for batching. However, this only helped to get to about 45% of GPU utilization and resulted in a 100 % GPU utilization.
What am I doing wrong? Is there a better way for feeding the inputs and batching them? I do not want to create the batches manually (creating subarrays of the numpy input array) because I want to use this example for a more complex input queue where I'll be reading and preprocessing the images in the graph.
I don't think my hardware should be the problem:
List item
Windows 10
NVidia GTX 960M
i7-6700HQ
32 GB RAM