Using TensorFlow's Dataset API with multi-GPU training - gpu

Using Tensorflow's new Dataset API for multi-GPU training (from TFRecords format) appears to perform considerably slower (1/4 slower) than running on a single GPU (1 vs. 4 Tesla K80s).
Looking at the output of nvidia-smi it appears that using 4 GPUs only causes gpu-utilization to be around 15% each, while with a single GPU it is around 45%.
Does loading data from disk (tfrecords-format) cause a bottleneck in the training speed? Using regular feed-dicts, where the entire dataset is loaded into memory is substantially faster than using the dataset API as well.

It seems your network is throttled by:
IO from the disc, as mentioned in your last paragraph
If you are starting your Dataset with reading off TFRecords, then it will read from disc; Instead, you could read them into a list/dict, and start with range sequence. Eg.
tf.data.Dataset()\
.range(your_data_size)\
.prefetch(20)\
.shuffle(buffer_size=20)\
.map(lambda i: your_loaded_list[i], num_parallel_calls=8)
Heavy pre/post-processing, as mentioned in your 2nd paragraph where single GPU utilization is 45%; if that was when you already pre-load data to memory, it suggests your network taking efforts outside of the "main" computation body.
First you may work to check if using multi-threading with the map call like above helps; also trimming down some tf.summary operations which could potentially feed back lots of unnecessary data which throttles your bandwidth/write to disc afterwards.
Hope this helps.

Related

Is it possible to do the whole training procedure in GPU with Tensorflow/Keras?

If the dataset is small enough to fit in the GPU memory, is it possible with Tensorflow to allocate it all initially and then do the training without having data transfers between CPU and GPU?
It seems to me that with tf.data this is not possible and the data transfer is not controlled by the programmer.
Analyzing the GPU workload during training, it reaches 75% with CIFAR10, but I would expect it to reach 100% being that the dataset fit in GPU memory.Also analyzing with Tensorboard I see that there are a lot of Send operations.
(I saw that there is a similar question quite old here, however at that time there was no tf.data yet)

Does Tensorflow automaticaly use multiple CPUs?

I have programmed some code doing an inference with Tensorflow's C API (CPU only). It is running on a cluster node, where I have access to 24 CPUs and 1 GPU. I do not make use of the GPU as I will need to do the task CPU-only later on.
Somehow every time I call the Tensorflow-Code from the other program (OpenFOAM) Tensorflow seems to run on all CPUs parallelized. However I have not done anything to cause this behavior. Now I would like to know whether Tensorflow does this parallelization by default?
Greets and thanks in advance!
I am not sure how you are using tensorflow. But a typical TensorFlow training has an input pipeline which can be thought as an ETL process. Following are the main activities involved:
Extract: Read data from persistent storage
Transform: Use CPU cores to parse and perform preprocessing operations on the data such as image decompression, data augmentation transformations (such as random crop, flips, and color distortions), shuffling, and batching.
Load: Load the transformed data onto the accelerator device(s) (for example, GPU(s) or TPU(s)) that execute the machine learning model.
CPUs are generally used during the data transformation. During the transformation, the data input elements are preprocessed. To improve the performance of the pre-processing, it is parallelized across multiple CPU cores by default.
Tensorflow provides the tf.data API which offers the tf.data.Dataset.map transformation. To control the parallelism, the map provides the num_parallel_calls argument.
Read more on this from here:
https://www.tensorflow.org/guide/performance/datasets

Deep networks on Cloud ML

I am trying to train a very deep model on Cloud ML however i am having serious memory issues that i am not managing to go around. The model is a very deep convolutional neural network to auto-tag music.
The model for this can be found in the image below. A batch of 20 with a tensor of 12x38832x1 is inserted in the network.
The music was originally 465894x1 samples which was then split into 12 windows. Hence, 12x38832x1. When using the map_fn function each loop would have the seperate 38832x1 samples (conv1d).
Processing windows at a time yields better results than the whole music using one CNN. This was split prior to storing the data in TFRecords in order to minimise the needed processing during training. This is loaded in a queue with maximum queue size of 200 samples (ie 10 batches).
Once dequeue, it is transposed to have the 12 dimension first which then can be used in the map_fn function for processing of the windows. This is not transposed prior to being queued as the first dimension needs to match the batch dimension of the output which is [20, 50]. Where 20 is the batch size as the data and 50 are the different tags.
For each window, the data is processed and the results of each map_fn are superpooled using a smaller network. The processing of the windows is done by a very deep neural network which is giving me problems to keep as all the config options i am giving are giving me out of memory errors.
As a model i am using one similar to Census Tensorflow Model.
First and foremost, i am not sure if this is the best option since for evaluation a separate graph is built and not shared variables. This would require double the amount of parameters.
Secondly, as a cluster setup, i have been using one complex_l master, 3 complex_l workers and 3 large_model parameter servers. I do not know if am underestimating the amount of memory needed here.
My model has previously worked with a much smaller network. However, increasing it in size started giving me bad out of memory errors.
My questions are:
The memory requirement is big, but i am sure it can be processed on cloud ml. Am i underestimating the amount of memory needed? What are your suggestions about the cluster for such a network?
When using a train.server in the dispatch function, do you need to pass on the cluster_spec so it is used in the replica_device setter? Or does it allocate on it's own? When not using it, and setting tf.configProto of log placement, all the variables seem to be on the master worker. On the Census Example in the task.py this is not passed on. I can assume this is correct?
How does one calculate how much memory is needed for a model (rough estimate to select the cluster)?
Is there any other tensorflow core tutorial how to setup such big jobs? (other than Census)
When training a big model in distributed between-graph replication, does all the model need to fit on the worker, or the worker only does ops and then transmits the results to the PS. Does that mean that the workers can have low memory just for singular ops?
PS: With smaller models the network trained successfully. I am trying to deepen the network for better ROC.
Questions coming up from on-going troubleshooting:
When using the replica_device_setter with the parameter cluster, i noticed that the master has very little memory and CPU usage and checking the log placement there are very little ops on the master. I checked the TF_CONFIG that is loaded and it says the following for the cluster field:
u'cluster': {u'ps': [u'ps-4da746af4e-0:2222'], u'worker': [u'worker-4da746af4e-0:2222'], u'master': [u'master-4da746af4e-0:2222']}
On the other hand, in the tf.train.Clusterspec documentation, it only shows workers. Does that mean that the master is not considered as worker? What happens in such case?
Error is it Memory or something else? EOF Error?

Debugging batching in Tensorflow Serving (no effect observed)

I have a small web server that gets input in terms of sentences and needs to return a model prediction using Tensorflow Serving. It's working all fine and well using our single GPU, but now I'd like to enable batching such that Tensorflow Serving waits a bit to group incoming sentences before processing them together in one batch on the GPU.
I'm using the predesigned server framework with the predesigned batching framework using the initial release of Tensorflow Serving. I'm enabling batching using the --batching flag and have set batch_timeout_micros = 10000 and max_batch_size = 1000. The logging does confirm that batching is enabled and that the GPU is being used.
However, when sending requests to the serving server the batching has minimal effect. Sending 50 requests at the same time almost linearly scales in terms of time usage with sending 5 requests. Interestingly, the predict() function of the server is run once for each request (see here), which suggests to me that the batching is not being handled properly.
Am I missing something? How do I check what's wrong with the batching?
Note that this is different from How to do batching in Tensorflow Serving? as that question only examines how to send multiple requests from a single client, but not how to enable Tensorflow Serving's behind-the-scenes batching for multiple separate requests.
(I am not familiar with the server framework, but I'm quite familiar with HPC and with cuBLAS and cuDNN, the libraries TF uses to do its dot products and convolutions on GPU)
There are several issues that could cause disappointing performance scaling with the batch size.
I/O overhead, by which I mean network transfers, disk access (for large data), serialization, deserialization and similar cruft. These things tend to be linear in the size of the data.
To look into this overhead, I suggest you deploy 2 models: one that you actually need, and one that's trivial, but uses the same I/O, then subtract the time needed by one from another.
This time difference should be similar to the time running the complex model takes, when you use it directly, without the I/O overhead.
If the bottleneck is in the I/O, speeding up the GPU work is inconsequential.
Note that even if increasing the batch size makes the GPU faster, it might make the whole thing slower, because the GPU now has to wait for the I/O of the whole batch to finish to even start working.
cuDNN scaling: Things like matmul need large batch sizes to achieve their optimal throughput, but convolutions using cuDNN might not (At least it hasn't been my experience, but this might depend on the version and the GPU arch)
RAM, GPU RAM, or PCIe bandwidth-limited models: If your model's bottleneck is in any of these, it probably won't benefit from bigger batch sizes.
The way to check this is to run your model directly (perhaps with mock input), compare the timing to the aforementioned time difference and plot it as a function of the batch size.
By the way, as per the performance guide, one thing you could try is using the NCHW layout, if you are not already. There are other tips there.

Minimizing GPU's idle time when using TensorFlow

How can we minimize the idle time of a GPU when training a network using Tensorflow ?
To do this :-
I used multiple Python threads to preprocess data and feed it to a tf.RandomShuffleQueue from where the TensorFlow took the data.
I thought that this will be more efficient than the feed_dict method.
However I still find on doing nvidia-smi that my GPU still goes from 100% utilization to 0% utilization and back to 100% quite often.
Since my network is large and the dataset is also large 12 million, any fruitful advice on speeding up would be very helpful.
Is my thinking that reading data directly from a tf.Queue is better than feed_dict correct ?
NOTE: I am using a 12 GB Titan X GPU (Maxwell architecture)
You are correct on assuming that feeding through a queue is better than feed_dict, for multiple reasons (mainly loading and preprocessing done on CPU, and not on the main thread). But one thing that can undermine this is if the GPU consume the data faster than it is loaded. You should therefore monitor the size of your queue to check if you have times where the queue size is 0.
If this is the case, I would recommand you to move your threading process into the graph, tensorflow as some nice mecanismes to allow batch loading (your loading batchs should be larger than your training batchs to maximise your loading efficiency, I personnaly use training batchs of 128 and loading batchs of 1024) in threads on CPU very efficiently. Moreover, you should place your queue on CPU and give it a large maximum size, you will be able to take advantage of the large size of RAM memory (I always have more than 16000 images loaded in RAM, waiting for training).
If you still have troubles, you should check tensorflow's performance guide:
https://www.tensorflow.org/guide/data_performance