I am running deep learning CNN (4-CNN layers and 3 FNN layers) model (written in Keras with tensorflow as backend) on two different machines.
I have 2 machines (A: with a GTX 960 graphics GPU with 2GB memory & clock speed: 1.17 GHz and B: with a Tesla K40 computation GPU with 12GB memory & clock speed: 745MHz)
But when I run the CNN model on A:
Epoch 1/35
50000/50000 [==============================] - 10s 198us/step - loss: 0.0851 - acc: 0.2323
on B:
Epoch 1/35
50000/50000 [==============================] - 43s 850us/step - loss: 0.0800 - acc: 0.3110
The numbers are not even comparable. I am quite new to deep learning and running code on GPUs. Could someone please help me explain why the numbers are so different?
Dataset: CIFAR-10 (32x32 RGB images)
Model batch size: 128
Model number of parameters: 1.2M
OS: Ubuntu 16.04
Nvidia driver version: 384.111
Cuda version: 7.5, V7.5.17
Please let me know if you need any more data.
Edit 1: (adding CPU info)
Machine A (GTX 960): 8 cores - Intel(R) Core(TM) i7-6700 CPU # 3.40GHz
Machine B (Tesla K40c):8 cores - Intel(R) Xeon(R) CPU E5-2637 v4 # 3.50GHz
TL;DR: Measure again with a larger batch size.
Those results do not surprise me much. It's a common mistake to think that an expensive Tesla card (or a GPU for that matter) will automatically do everything faster. You have to understand how GPUs work in order to harness their power.
If you compare the base clock speeds of your devices, you will find that your Xeon CPU has the fastest one:
Nvidia K40c: 745MHz
Nvidia GTX 960: 1127MHz
Intel i7: 3400MHz
Intel Xeon: 3500MHz
This gives you a hint of the speeds at which these devices operate and gives a very rough estimate of how fast they can crunch numbers if they would only do one thing at a time, that is, with no parallelization.
So as you see, GPUs are not fast at all (for some definition of fast), in fact they're quite slow. Also note how the K40c is in fact slower than the GTX 960.
However, the real power of a GPU comes from its ability to process a lot of data simultaneously! If you now check again at how much parallelization is possible on these devices, you will find that your K40c is not so bad after all:
Nvidia K40c: 2880 cuda cores
Nvidia GTX 960: 1024 cuda cores
Intel i7: 8 threads
Intel Xeon: 8 threads
Again, these numbers give you a very rough estimate of how many things these devices can do simultaneously.
Note: I am severely simplifying things: In absolutely no way is a CPU core comparable to a cuda core! They are very very different things. And in no way can base clock frequencies be compared like this! It's just to give an idea of what's happening.
So, your devices needs to be able to process a lot of data in parallel in order to maximize their throughput. Luckily tensorflow already does this for you: It will automatically parallelize all those heavy matrix multiplications for maximum throughput. However this is only going to be fast if the matrices have a certain size. Your batch size is set to 128 which means that almost all of these matrices will have the first dimension set to 128. I don't know the details of your model, but if the other dimensions are not large either, then I suspect that most of your K40c is staying idle during those matrix multiplications. Try to increase the batch size and measure again. You should find that larger batch sizes will make the K40c faster in comparison with the GTX 960. The same should be true for increasing the model's capacity: increase the number of units in the fully-connected layers and the number of filters in the convolutional layers. Adding more layers will probably not help here. The output of the nvidia-smi tool is also very useful to see how busy a GPU really is.
Note however that changing the model's hyper-parameter and/or the batch size will of course have a huge impact on how the model is able to train successfully and naturally you might also hit memory limitations.
Perhaps if increasing the batch size or changing the model is not an option, you could also try to train two models on the K40c at the same time to make use of the idle cores. However I have never tried this, so it might not work at all.
Related
I use CPU xeon E5-1650 (3.2 GHz, 6Cores, 12 Threads) for training Tensorflow model.
But training is so slow...
If I will use desktop computer with typical CPU and 2 GPU GeForce GTX750 (2 Gb), it will be faster?
Using the GPU's will be faster. The only things to keep in mind are that the size of your model is then constrained by the memory of the GPUs and that you have to choose the right sets of version numbers and drivers, such that your GPU is supported.
I have Tensorflow 1.4 GPU version installed. Cuda8 is installed too.
I trained my pretty simple GAN network on MNIST data.
I have AMD FX 8320 CPU, 16Gb system memory and SSD hard drive.
It took about 17 seconds per epoch on GeForce 720 GPU with 1GB memory.
The training utilized about 25% of GPU and 99% of memory. CPU was loaded prettyhigh, close to 100%.
Then I inserted other video board with GeForce1050 Ti GPU and 4Gb memory instead of previous. The GPU was loaded only for 5-6%, memory was utilized for 93%.
But I still got about 17s per epoch and high load for CPU.
So maybe Tensorflow has some settings to utilize more GPU?
Or what is a cause of high CPU load and low GPU load?
If you are training a simple GAN network it is fairly likely that your old GPU was not the bottleneck in the first place. So, improving it had no effect. If the amount of work done per sess.run() call very small, the overheads (executing your Python code, copying the input data to GPU, starting and running the TensorFlow executor, scheduling all the operations to GPU, etc) can dominate your computation.
The only sure way of knowing what happens is to profile. You can take a look here https://www.tensorflow.org/performance/performance_guide as a starting point. The timeline tool it mentions can be fairly useful. See here for more details: Can I measure the execution time of individual operations with TensorFlow?.
Agree, for MNIST datasets, there are probably other bottlenecks in the system, not the GPU. I ran 2 side-by side TensorFlows,
Intel i7 4600M with NVIDIA Quadro K1100M GPU and 12 GB RAM, which is a 4th Gen Haswell Intel machine, and
Intel i5 8300U with No Cuda GPU and 16GB of RAM.
Basically 8th Gen Kaby Lake Intel CPU vs 4th Gen Intel, and I got:
4th Gen Intel chip with NVIDIA GPU:
311.5 sec, 315.9 sec, 313.0 sec to complete all 10 epocs on a MNIST run
8th Gen Intel chip with no GPU:
252.7 sec, 243.5 sec, 254.9 sec
So I'm running 20% faster with no GPU, just a newer generation of Intel chip.
I'm using keras with tensorflow backend on a computer with a nvidia Tesla K20c GPU. (CUDA 8)
I'm tranining a relatively simple Convolutional Neural Network, during training I run the terminal program nvidia-smi to check the GPU use. As you can see in the following output, the GPU utilization commonly shows around 7%-13%
My question is: during the CNN training shouldn't the GPU usage be higher? is this a sign of a bad GPU configuration or usage by keras/tensorflow?
nvidia-smi output
Could be due to several reasons but most likely you're having a bottleneck when reading the training data. As your GPU has processed a batch it requires more data. Depending on your implementation this can cause the GPU to wait for the CPU to load more data resulting in a lower GPU usage and also a longer training time.
Try loading all data into memory if it fits or use a QueueRunner which will make an input pipeline reading data in the background. This will reduce the time that your GPU is waiting for more data.
The Reading Data Guide on the TensorFlow website contains more information.
You should find the bottleneck:
On windows use Task-Manager> Performance to monitor how you are using your resources
On Linux use nmon, nvidia-smi, and htop to monitor your resources.
The most possible scenarios are:
If you have a huge dataset, take a look at the disk read/write rates; if you are accessing your hard-disk frequently, most probably you need to change they way you are dealing with the dataset to reduce number of disk access
Use the memory to pre-load everything as much as possible.
If you are using a restful API or any similar services, make sure that you do not wait much for receiving what you need. For restful services, the number of requests per second might be limited (check your network usage via nmon/Task manager)
Make sure you do not use swap space in any case!
Reduce the overhead of preprocessing by any means (e.g. using cache, faster libraries, etc.)
Play with the bach_size (however, it is said that higher values (>512) for batch size might have negative effects on accuracy)
The reason may be that your network is "relatively simple". I had a MNIST network with 60k training examples.
with 100 neurons in 1 hidden layer, CPU training was faster and GPU utilization on GPU training was around 10%
with 2 hidden layers, 2000 neurons each, GPU was significantly faster(24s vs 452s on CPU) and its utilization was around 39%
I have a pretty old PC (24GB DDR3-1333, i7 3770k) but a modern graphic card(RTX 2070 + SSDs if that matters) so there is a memory-GPU data transfer bottleneck.
I'm not yet sure how much room for improvement is here. I'd have to train a bigger network and compare it with better CPU/memory configuration + same GPU.
I guess that for smaller networks it doesn't matter that much anyway because they are relatively easy for the CPU.
Measuring GPU performance and utilization is not as straightforward as CPU or Memory. GPU is an extreme parallel processing unit and there are many factors. The GPU utilization number shown by nvidia-smi means what percentage of the time at least one gpu multiprocessing group was active. If this number is 0, it is a sign that none of your GPU is being utilized but if this number is 100 does not mean that the GPU is being used at its full potential.
These two articles have lots of interesting information on this topic:
https://www.imgtec.com/blog/a-quick-guide-to-writing-opencl-kernels-for-rogue/
https://www.imgtec.com/blog/measuring-gpu-compute-performance/
Low GPU utilization might be due to the small batch size. Keras has a habit of occupying the whole memory size whether, for example, you use batch size x or batch size 2x. Try using a bigger batch size if possible and see if it changes.
I am trying to calibrate my expectations around a single laptop's ability to train a neural network. I am using tensorflow and keras and after about say 10 minutes, it crashes. I've seen killsignal 9 exit code 137, and I'm wondering if this is due to insufficient memory? Other times, when one-hot encoding using np_utils.to_categorical() I've seen the words memoryerror in the console, and that's it and my script crashes. This is just trying to transform the outputs into what a neural net expects before it even runs.
I have 6400 inputs and 1500 outputs and a small hidden layer of 100 nodes. Batch size 128.
That's it. It's not even deep. It crashes whether using an nvidia gpu or a 4 core cpu. For you pros, is my network too big to train on my system (i7 4 cores, 16gb ram, nvidia GT 750m , compute capability 3.0). Is my neural network considered a large one? I have 3 million samples, btw.
1) How do I estimate the amount of memory required for my network? Is it 6400 (# inputs) * 1500 (#outputs) * 4 bytes (per parameter) = 38.4 gb? Can I see how much memory is being used in real time on a mac somewhere? I've used activity monitor and the memory pressure gauge is normal.
2) GPUs typically are maxing out at 8gb-12gb of RAM, whereas CPUs on desktops could easily have 64 gb. So if the memory requirements of my network exceed 8gb of RAM, would it be impossible to train on a single GPU?
3) what is the difference, especially memory wise, between batch_size and batch_training?
Thank you!
Your calculation was correct with the multiplication, with the exception, that you are dealing with mega bytes and not giga bytes. The actual requirement is 6400*100*4 + 100*1500*4, which should ~4 MB if you use the default float32. You multiply the layer sizes of two subsequent layers together, because every neuron is connected to every neuron in the subsequent layer. Then the whole memory requirement is multiplied by the batch size. This is why convolutional layers are used to train deep networks.
For gpu I am using nvidia-smi to monitor the memory requirements on linux. A google search gave me this for mac: https://phvu.net/2015/03/30/nvidia-smi-on-macos/. If the memory requirements exceed the GPU memory you can not train it on the gpu. You could train it on a cpu, but that will take ages.
There are multiple ways to train with a large training set. Normally generator are used to train on batches. This means only loading the parts of the training set you actually need (https://keras.io/getting-started/faq/#how-can-i-use-keras-with-datasets-that-dont-fit-in-memory).
Finding the memory requirements for your neural network not only depends on the size of the network or the number of parameters itself. For calculating the memory foot print of the neural network, one document that I always go to is the Stanford CS231n Convolutional Neural Networks for Visual Recognition course notes. Please take a look at the portion where they find the memory requirements for each and every layer of the network.
To add to that, batch size (the number of inputs per one batch) is a crucial factor in deciding the 'memory usage'. For example, in a newer NVIDIA P100 GPU, I can go as much as 2048 images per batch if I train a CIFAR10 model and less than 512 or 256 images if I train AlexNet on ImageNet dataset. The input size matters a lot, so does the batch size since the GPU memory need to account for the batch of inputs.
One way to test the batch size which works is to do nvidia-smi and see how much memory is used. Since doing it every now and then is boring, I usually do watch nvidia-smi in my Linux machine. In my MAC, I do not have a NVIDIA GPU installed so I seldom use these tricks. When I want to, I will write quick bash scripts like these:
while true; do nvidia-smi; sleep 0.5; clear; done
You can port install watch in Mac as well.
Also, two of my most favorite tools of all time are htop and dstat.
htop gives you a much better graphical interface to the famous top command in Linux. It gives you real-time information regarding your memory and processor usage, along with the different processes. If you give sudo access to htop, you can change the niceness and other parameters directly from the interface.
dstat gives you real time information about your I/O. In most cases, I will add two flags -d and -n to specify disk and network usage only.
Fortunately, htop can be brew installed on Mac by running:
brew install htop
dstat on the other hand is not directly available. Please look into ifstat or iostat for similar functionalities.
A screenshot of htop command in Mac.
I am using Nvidia Digits Box with GPU (Nvidia GeForce GTX Titan X) and Tensorflow 0.6 to train the Neural Network, and everything works. However, when I check the Volatile GPU Util using nvidia-smi -l 1, I notice that it's only 6%, and I think most of the computation is on CPU, since I notice that the process which runs Tensorflow has about 90% CPU usage. The result is the training process is very slow. I wonder if there are ways to make full usage of GPU instead of CPU to speed up the training process. Thanks!
I suspect you have a bottleneck somewhere (like in this github issue) -- you have some operation which doesn't have GPU implementation, so it's placed on CPU, and the GPU is idling because of data transfers. For instance, until recently reduce_mean was not implemented on GPU, and before that Rank was not implemented on GPU, and it was implicitly being used by many ops.
At one point, I saw a network from fully_connected_preloaded.py being slow because there was a Rank op that got placed on CPU, and hence triggering the transfer of entire dataset from GPU to CPU at each step.
To solve this I would first recommend upgrading to 0.8 since it had a few more ops implemented for GPU (reduce_prod for integer inputs, reduce_mean and others).
Then you can create your session with log_device_placement=True and see if there are any ops placed on CPU or GPU that would cause excessive transfers per step.
There are often ops in the input pipeline (such as parse_example) which don't have GPU implementations, I find it helpful sometimes to pin the whole input pipeline to CPU using with tf.device("/cpu:0"): block