Improving cross-validation throughput in TensorFlow, Keras - tensorflow

I am working on CNN models which are intended to predict a protein's structure from its amino acid sequence. I am implementing my CNN's in Keras. The Keras API is the one that comes bundled with TensorFlow 1.4.0, so obviously TensorFlow is my backend. I have installed the GPU version of TensorFlow, and I have verified that the GPU is being used. My GPU is somewhat older, an NVidia GTX 760.
When I perform 3X cross-validation to help select architectures and hyperparameters, I have 50K examples in my training folds and 25K samples in my validation folds. These are decently large data sets, however they're small in comparison to the RAM available in my computer (16 GB) or on my GPU (2 GB). Fully unpacked and expressed as float32 values, with redundancy introduced because of sliding windows, all the folds taken together, input plus target values, occupies 316 MB. I have pre-calculated my folds, and saved files of each fold to disk. When I experiment with architectures and hyperparameters, the same folds are being used in every trial.
I started with networks containing a single hidden layer to see what I could achieve, and then switched to two hidden layers. I used a fixed batch size of 64 for all of my early experiments. Training proceeded quickly enough that I didn't concern myself with speed. Performing a 3X cross-validation for a given architecture typically took about 12 minutes.
But in the last experiment that I did with two-layer networks, I decided to start investigating the effect of batch size. I learned that smaller batch sizes gave me better results, up to a point. Batch sizes of 8 were the smallest ones that I could count on not to crash. My loss values will occasionally flip to NaN with batch sizes of 4, and they will frequently flip to NaN with batch sizes of 1 or 2. After that occurs, the network becomes untrainable. I am aware of the possibility of gradient instability. I think I was getting some.
So why not just use batch sizes of 8 and keep going? The problem is speed. Using two hidden layers, batches of eight took me approximately 35 minutes to cross-validate. Batches of 64, as I mentioned above, took one third that much time. My first experiments with three hidden layers have taken 45 to 65 minutes per trial. And I want to investigate potentially hundreds of architectures and hyperparameters, using still deeper networks. With small batches, I can see that the batch-by-batch progress bar in Keras progresses more slowly. I can see much longer pauses when an epoch ends.
Yes, I can upgrade my GPU to a 10 series. I think that will only double my throughput at most? Yes, I can rent GPU time in the cloud. Eventually I might do that. But if my software is inefficient, I definitely don't want to set it loose in the cloud to burn my money.
It is my understanding (please correct me if I am wrong) that when the GPU is used in a normal TF / Keras work flow, each individual batch is sent separately from the GPU to the CPU. If I am training 50 networks in a 3X cross-validation scheme, this would mean that I'm sending the same data to my GPU 150 times. As I mentioned earlier, all my data occupies at most 316 MB, about 15% of the RAM available on the GPU. Can I devise a workflow which sends this 316 MB to the GPU once, and if so, will that have a useful impact on my throughput? Intuitively, it feels like it should.
Are there other bottlenecks I should be thinking about? Is there a way to profile TF or Keras operations?
Thanks for any advice you may have!

Okay. I know that you're more concerned about throughput from Keras and your hardware, but there are a few things I'd like to mention here:
smaller batch sizes gave me better results
Given you case, where you have not so huge data, assuming you're running the training for fixed number of epochs (say 5), training with lesser batch size is naturally expected to give you a slightly better result as it would mean a higher number of back-prop steps overall compared to that of a higher batch-size. If you're training for a fixed number of training steps instead, I don't know why this is happening.
loss values will occasionally flip to NaN with batch sizes of 4
Again, I'm assuming you're using batch-normalization here, with CNNs. While using BN, it's never actually recommended to use a smaller batch-size like 2 or 4 (or even 8). And probably, one of the reasons why you can be facing NaN with smaller batch-size is if you have low-variance in the current batch and if you take the epsilon value too small, you might have very small values that can lead to numerical instability going forward. But more generally, this might be a case of gradient instability like you mentioned. Consider using gradient clipping to see if it helps.
GPU Workflow
Here, I assume that you have only 1 GPU. And unfortunately, you can't parallelise using single-GPU. To clarify, you shouldn't be concerned about the size of your data for GPU RAM. In most of the single-GPU cases, the current batch stays on the CPU and GPU would only take up the operations. Rather, you should be concerned about the size of parameters that GPU would be computing. Since for 1-layer experiment and 3-layers experiment your operations differ a lot, I don't think it's possible as you can't place multiple ops on same device simultaneously. The best case for you here would be to use a larger batch-size (not too large - as this would reduce the number of back-prop steps in case of training for fixed-epochs), so that you'd cover more data in a single-go.
Just a tip for hyper-paramter tuning, you can consider using Highway-CNNs. These are inspired from gating mechanism of LSTMs where you specify a large number of hidden layers and the network figures out itself on how to control the information flow among the layers. So in short, this would practically eliminate your efforts of tuning the depth of network, and allowing you tune other hyper-params like learning rate or filter-sizes etc.
I hope at least some of this is relevant and helpful to you ;)

Related

Why does Training time not reduce when training a keras model after Increasing the batch size in beyond a certain amount

I am currently traing an NLP model in Keras with TF 2.8 where I am experimenting by adding GRU and LSTM layers. When I train the model, I used different batch size to see the impact it had on the accuracy and overal training time.
What I noticed was that after Increasing the batch size after a certain amount the training time doesnt reduce, after a certain amount the training size stayed the same.
I started with a batch size of 2 then slowly increased upto 4096 trying multiples of two, yet after 512 the training time remained the same.
It's often wrongly mentioned that batch learning is as fast or faster than on-line training. In fact, batch-learning is changing the weights once, the complete set of data (the batch) has been presented to the network. Therefore, the weight update frequency is rather slow. This explains why the processing speed in your measurements acts like you observed.
Even if its matrix operation, each row-colum multiplication might be happening on one gpu-core. So, full matrix multiplication is divided on as many cores as possible. For one matrix mul, each gpu-core takes some time, and when you add more images, that time increases, do more rows. If at batch size of 4, your gpu is already at full performance capacity, i.e. all cores are running, then increasing batch size is not going to give any advantage. Your added data just sits in gpu memory and is processed when an nvidia dice gets free of previous operation.
To get a further understanding for the training techniques, have a look at the 2003 paper The general inefficiency of batch training for gradient descent learning. It deals with the comparison of batch and on-line learning.
Also generally, RNN kernels can have O(timesteps) complexity, with batch size having a smaller effect than you might anticipate.

Prediction with GPU is much slower than with CPU?

curiously I just found out that my CPU is much faster for predictions.
Doing inference with GPU is much slower then with CPU.
I have tf.keras (tf2) NN model with a simple dense layer:
input = tf.keras.layers.Input(shape=(100,), dtype='float32')
X = X = tf.keras.layers.Dense(2)(input)
model = tf.keras.Model(input,X)
#also initiialized with weights from a file
weights = np.load("weights.npy", allow_pickle=True )
model.layers[-1].set_weights(weights)
scores = model.predict_on_batch(data)
For 100 samples doing predictions I get:
2 s for GPU
0.07 s for CPU (!)
I am using a simple geforce mx150 with 2gb
I also tried the predict_on_batch(x) as someone suggested this as it is more faster than just predict. But here it is of same time.
Refer: Why does keras model predict slower after compile?
Has anyone an idea, what is going on there? What could be an issue possibly?
Using the GPU puts a lot of overhead to load data on the GPU memory (through the relatively slow PCI bus) and to get the results back.
In order for the GPU to be more efficient than the CPU, the model must to be very big, have plenty of data and use algorithms that can run fully inside the GPU, without requiring partial results to be moved back to the CPU.
The optimal configuration depends on the quantity of memory and of cores inside your GPU, so you must do some tests, but the following rules apply:
Your NN must have at least >10k parameters, training data set must have at least 10k records. Otherwise your overhead will probably kill the performances of GPU
When you model.fit, use a large batch_size (pay attention, the default is only 32), possibly to contain your whole dataset, or at least a multiple of 1024. Do some test to find the optimum for you.
For some GPUs, it might help performing computations in float16 instead of float32. Follow this tutorial to see how to activate it.
If your GPU has specific Tensor Cores, in order to use efficiently its hardware, several data must be multiples of 8. In the preceding tutorial, see at the paragraph "Ensuring GPU Tensor Cores are used" what parameters must be changed and how. In general, it's a bad idea to use layers which contain a number of neurons not multiple of 8.
Some type of layers, namely RNNs, have an architecture which cannot be solved directly by the GPU. In this case, data must be moved constantly back and forth to CPU and the speed is lost. If a RNN is really needed, Tensorflow v2 has an implementation of the LSTM layer which is optimized for GPU, but some limitations on the parameters are present: see this thread and the documetation.
If you are training a Reinforcement Learning, activate an Experience Replay and use a memory buffer for the experience which is at least >10x your batch_size. This way, you will activate the NN training only when a big bunch of data is ready.
Deactivate as much verbosity as possible
If everything is set up correctly, you should be able to train your model faster with GPU than with CPU.
GPU is good if you have compute-intensive tasks (large models) due to the overhead of copying your data and results between the host and GPU. In your case, the model is very small. It means it will take you longer to copy data than to predict. Even if the CPU is slower than the GPU, you don't have to copy the data, so it's ultimately faster.

How to select batch size automatically to fit GPU?

I am training deep neural networks with a GPU. If I make samples too large, batches too large, or networks too deep, I get an out of memory error. In this case, it is sometimes possible to make smaller batches and still train.
Is it possible to calculate GPU size required for training and determine what batch size to choose beforehand?
UPDATE
If I print network summary, it displays number of "trainable parameters". Can't I estimate from this value? For example, take this, multiply by batch size, double for gradients etc?
PyTorch Lightning recently added a feature called "auto batch size", especially for this! It computes the max batch size that can fit into the memory of your GPU :)
More info can be found here.
Original PR: https://github.com/PyTorchLightning/pytorch-lightning/pull/1638
No, it is not possible to do this automatically. So you need to go through a lot of trial and error to find appropriate size if you want your batch to be as much as possible.
Stanford's CNN class provides some guidance how to estimate the memory size, but all suggestions are related to CNN (not sure what do you train).
I think Salvador here means that it is not possible to analytically compute the best suited batch size, however, as all things are in ML, it is just another hyperparameter, that can be added to your grid search to be computed automatically. Simply evaluate your model's loss or accuracy (however you measure performance) for the best and most stable (least variable) measure given several batch sizes, say some powers of 2, such as 64, 256, 1024, etc. Then keep use the best found batch size. Note that batch size can depend on your model's architecture, machine hardware, etc. For example, if you move your modeling from a local PC to some cloud compute engine (GCP, AWS, Azure,...), then the batch size which was too large for your PC's RAM becomes easily suitable for practically limitless RAM/CPU/GPU (mind the costs).

Low GPU usage by Keras / Tensorflow?

I'm using keras with tensorflow backend on a computer with a nvidia Tesla K20c GPU. (CUDA 8)
I'm tranining a relatively simple Convolutional Neural Network, during training I run the terminal program nvidia-smi to check the GPU use. As you can see in the following output, the GPU utilization commonly shows around 7%-13%
My question is: during the CNN training shouldn't the GPU usage be higher? is this a sign of a bad GPU configuration or usage by keras/tensorflow?
nvidia-smi output
Could be due to several reasons but most likely you're having a bottleneck when reading the training data. As your GPU has processed a batch it requires more data. Depending on your implementation this can cause the GPU to wait for the CPU to load more data resulting in a lower GPU usage and also a longer training time.
Try loading all data into memory if it fits or use a QueueRunner which will make an input pipeline reading data in the background. This will reduce the time that your GPU is waiting for more data.
The Reading Data Guide on the TensorFlow website contains more information.
You should find the bottleneck:
On windows use Task-Manager> Performance to monitor how you are using your resources
On Linux use nmon, nvidia-smi, and htop to monitor your resources.
The most possible scenarios are:
If you have a huge dataset, take a look at the disk read/write rates; if you are accessing your hard-disk frequently, most probably you need to change they way you are dealing with the dataset to reduce number of disk access
Use the memory to pre-load everything as much as possible.
If you are using a restful API or any similar services, make sure that you do not wait much for receiving what you need. For restful services, the number of requests per second might be limited (check your network usage via nmon/Task manager)
Make sure you do not use swap space in any case!
Reduce the overhead of preprocessing by any means (e.g. using cache, faster libraries, etc.)
Play with the bach_size (however, it is said that higher values (>512) for batch size might have negative effects on accuracy)
The reason may be that your network is "relatively simple". I had a MNIST network with 60k training examples.
with 100 neurons in 1 hidden layer, CPU training was faster and GPU utilization on GPU training was around 10%
with 2 hidden layers, 2000 neurons each, GPU was significantly faster(24s vs 452s on CPU) and its utilization was around 39%
I have a pretty old PC (24GB DDR3-1333, i7 3770k) but a modern graphic card(RTX 2070 + SSDs if that matters) so there is a memory-GPU data transfer bottleneck.
I'm not yet sure how much room for improvement is here. I'd have to train a bigger network and compare it with better CPU/memory configuration + same GPU.
I guess that for smaller networks it doesn't matter that much anyway because they are relatively easy for the CPU.
Measuring GPU performance and utilization is not as straightforward as CPU or Memory. GPU is an extreme parallel processing unit and there are many factors. The GPU utilization number shown by nvidia-smi means what percentage of the time at least one gpu multiprocessing group was active. If this number is 0, it is a sign that none of your GPU is being utilized but if this number is 100 does not mean that the GPU is being used at its full potential.
These two articles have lots of interesting information on this topic:
https://www.imgtec.com/blog/a-quick-guide-to-writing-opencl-kernels-for-rogue/
https://www.imgtec.com/blog/measuring-gpu-compute-performance/
Low GPU utilization might be due to the small batch size. Keras has a habit of occupying the whole memory size whether, for example, you use batch size x or batch size 2x. Try using a bigger batch size if possible and see if it changes.

TensorFlow RNN training 100% CPU while only using 60% GPU

I'm working on code that trains a relatively large RNN (128 cell LSTM and some added layers). The main process is maxing out a core on the CPU, and I'm wondering if this is normal or whether I can optimize it. During the training loop (session.run calls) it's using about 60-70% GPU load while using 100% CPU load on one core. Note that data sampling work is already being done concurrently on other cores, so it's just the updating of the model parameters. Is this regular for such applications in TensorFlow or should the CPU load be much lower, while using the full capacity of the GPU?
We don't have full documentation on it yet, but you can take a look at the profiling information to see if it gives you more of an idea of where the time is going:
https://github.com/tensorflow/tensorflow/issues/1824#issuecomment-225754659
I think RNN cell have two input, it must wait for those two direction input when traning data, in other word, it optimize parallelism don't as easy as CNN. You can use a big batch size to improve the GPU utilization rate, but maybe cause other problem like that paper On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima.