As the question already suggests, I am new to deep learning. I know that the learning process of the model will be slow without GPU. If I am willing to wait, Will it be OK if i use CPU only ?
Many operations which are performed in computing deep learning (and neural networks in general) can be run in parallel, meaning they can be calculated independently then aggregated later. This is, in part, because most of the operations are on vectors.
A typical consumer CPU has between 4 to 8 cores, and hyperthreading allows them to be treated as 8 or 16 respectively. Server CPUs can have between 4 to 24 cores, 8 to 48 threads respectively. Additionally, most modern CPUs have SIMD (single instruction multiple data) extensions which allow them to perform vector operations in parallel on a single thread. Depending on the data type you're working with, an 8 core CPU can perform 8 * 2 * 4 = 64 to 8 * 2 * 8 = 128 vector calculations at once.
Nvidia's new 1080ti has 3584 CUDA cores, which essentially means it can perform 3584 vector calculations at once (hyperthreading and SIMD don't come into play here). That's 56 to 28 times more operations at once than an 8 core CPU. So, whether you're training a single network, or multiples to tune meta-parameters, it will probably be significantly faster on a GPU than a CPU.
Depending on what you are doing, it might take a lot longer. I had 20x speedups be using a GPU. If you read some Computer Vision papers, they train their networks on ImageNet for about 1-2 weeks. Now imagine if that took 20x longer...
Having said that: There are much simpler tasks. For example, for my HASY dataset you can train a reasonable network without a GPU in probably 3 hours. Similar small datasets are MNIST, CIFAR-10, CIFAR-100.
Computationally intensive part of the neural network is multiple matrix multiplications. And how do we make it faster? We can do this by doing all the operations at the same time instead of doing it one after the other. This is in a nutshell why we use GPU (graphics processing units) instead of a CPU (central processing unit).
Google used to have a powerful system, which they had specially built for training huge nets. This system costs $5 billion, with multiple clusters of CPUs.
Few years later, researchers at Stanford built the same system in terms of computation to train their deep nets using GPU. They reduced the costs to $33K. This system was built using GPUs, and it gave the same processing power as Google’s system.
Source: https://www.analyticsvidhya.com/blog/2017/05/gpus-necessary-for-deep-learning/
Deep learning is all about building a mathematical model of the reality or of some kind of part of reality for some kind of specific use by using a lot of training data so you use a lot of training data from the real world that you have collected and then you can train your model so your mathematical model can predict the other outcomes when you give it new data as input so you basically can train this mathematical model but it needs a lot of data and this training needs a lot of computation. So there are lot of computational heavy operations that need to take place and also you need a lot of data. Therefore, for example companies such as Nvidia who are traditionally have been making gaming GPUs for graphics, now they are also having a huge part of the revenue coming from AI and Machine Learning and all of these scientists who want to train their models and you see companies like Google and Facebook, all of them are using GPUs currently to train their ML models.
If you ask this question you probably need a GPU/TPU (Tensor Processing Unit).
You can get one with Google Colab GPU for "free". They have a pretty cool cloud GPU technology
You can stat working with your google accounts
with a Jupiter notebook: https://colab.research.google.com/notebooks/intro.ipynb
Kaggle (Google Owned Data Science competition site) also has this option to create Jupiter notebooks + GPU, but only in limited cases:
notebooks: https://www.kaggle.com/kernels
Documentation for it: https://www.kaggle.com/docs/notebooks
Related
I am implementing fast DNN model training using knowledge distillation, as illustrated in the figure below, to run the teacher and student models in parallel.
I checked some popular repos like NervanaSystems/distiller and peterliht/knowledge-distillation-pytorch. They execute the forward operations of the student and teacher models step by step, i.e., not in parallel on different devices (GPU or CPU).
I am trying to speed up this training process to run the 2 models at the same time using multiple devices (e.g., loading one model on CPU and not interrupting the GPU training of the other model).
What is the proper way to run 2 models in parallel? Can I use Python multiprocessing library to start 2 processes for the 2 models, i.e., loading 2 model instances and running forward()? I am using MXNet but this is a general question for all ML frameworks.
Edit:
My plan is to put a light-weight pre-trained teacher model on CPU which only runs forward pass with frozen parameters.
The student model is a large model to be trained on GPU (distributedly).
This task is not for model compression.
I suppose moving a light task (teacher's forward pass) to CPU can increase the overlap and make this pipeline faster.
The idea is from a workshop paper: Infer2Train: leveraging inference for better training of deep networks.
I am trying to speed up this training process to run the 2 models at
the same time using multiple devices
I doubt that would bring any speed up, especially in case of:
(e.g., loading one model on CPU and not interrupting the GPU training
of the other model).
as deep learning is a pipeline which also utilizes CPU, possibly multiple cores (say for data loading but also receiving metrics, gathering them etc.).
Furthermore CPU is rather ineffective for neural network training when compared to GPU/TPU unless you have some tailored CPU architecture (stuff like MobileNet). If you were to train student on CPU, you might significantly slow down pipeline elements of teacher.
What is the proper way to run 2 models in parallel?
Again, depending on the model, but it would be best to utilize 2 GPUs for training and split CPU cores for other tasks between them. In your case you would have to synchronize teacher and student predictions across two devices though.
Can I use Python multiprocessing library to start 2 processes for the 2 models, i.e., loading 2 model instances and running forward()?
PyTorch provides primitives (e.g. "their" multiprocessing wrapper, Futures etc.) which could possibly be used for that, not sure about mxnet or a-like.
I am working on CNN models which are intended to predict a protein's structure from its amino acid sequence. I am implementing my CNN's in Keras. The Keras API is the one that comes bundled with TensorFlow 1.4.0, so obviously TensorFlow is my backend. I have installed the GPU version of TensorFlow, and I have verified that the GPU is being used. My GPU is somewhat older, an NVidia GTX 760.
When I perform 3X cross-validation to help select architectures and hyperparameters, I have 50K examples in my training folds and 25K samples in my validation folds. These are decently large data sets, however they're small in comparison to the RAM available in my computer (16 GB) or on my GPU (2 GB). Fully unpacked and expressed as float32 values, with redundancy introduced because of sliding windows, all the folds taken together, input plus target values, occupies 316 MB. I have pre-calculated my folds, and saved files of each fold to disk. When I experiment with architectures and hyperparameters, the same folds are being used in every trial.
I started with networks containing a single hidden layer to see what I could achieve, and then switched to two hidden layers. I used a fixed batch size of 64 for all of my early experiments. Training proceeded quickly enough that I didn't concern myself with speed. Performing a 3X cross-validation for a given architecture typically took about 12 minutes.
But in the last experiment that I did with two-layer networks, I decided to start investigating the effect of batch size. I learned that smaller batch sizes gave me better results, up to a point. Batch sizes of 8 were the smallest ones that I could count on not to crash. My loss values will occasionally flip to NaN with batch sizes of 4, and they will frequently flip to NaN with batch sizes of 1 or 2. After that occurs, the network becomes untrainable. I am aware of the possibility of gradient instability. I think I was getting some.
So why not just use batch sizes of 8 and keep going? The problem is speed. Using two hidden layers, batches of eight took me approximately 35 minutes to cross-validate. Batches of 64, as I mentioned above, took one third that much time. My first experiments with three hidden layers have taken 45 to 65 minutes per trial. And I want to investigate potentially hundreds of architectures and hyperparameters, using still deeper networks. With small batches, I can see that the batch-by-batch progress bar in Keras progresses more slowly. I can see much longer pauses when an epoch ends.
Yes, I can upgrade my GPU to a 10 series. I think that will only double my throughput at most? Yes, I can rent GPU time in the cloud. Eventually I might do that. But if my software is inefficient, I definitely don't want to set it loose in the cloud to burn my money.
It is my understanding (please correct me if I am wrong) that when the GPU is used in a normal TF / Keras work flow, each individual batch is sent separately from the GPU to the CPU. If I am training 50 networks in a 3X cross-validation scheme, this would mean that I'm sending the same data to my GPU 150 times. As I mentioned earlier, all my data occupies at most 316 MB, about 15% of the RAM available on the GPU. Can I devise a workflow which sends this 316 MB to the GPU once, and if so, will that have a useful impact on my throughput? Intuitively, it feels like it should.
Are there other bottlenecks I should be thinking about? Is there a way to profile TF or Keras operations?
Thanks for any advice you may have!
Okay. I know that you're more concerned about throughput from Keras and your hardware, but there are a few things I'd like to mention here:
smaller batch sizes gave me better results
Given you case, where you have not so huge data, assuming you're running the training for fixed number of epochs (say 5), training with lesser batch size is naturally expected to give you a slightly better result as it would mean a higher number of back-prop steps overall compared to that of a higher batch-size. If you're training for a fixed number of training steps instead, I don't know why this is happening.
loss values will occasionally flip to NaN with batch sizes of 4
Again, I'm assuming you're using batch-normalization here, with CNNs. While using BN, it's never actually recommended to use a smaller batch-size like 2 or 4 (or even 8). And probably, one of the reasons why you can be facing NaN with smaller batch-size is if you have low-variance in the current batch and if you take the epsilon value too small, you might have very small values that can lead to numerical instability going forward. But more generally, this might be a case of gradient instability like you mentioned. Consider using gradient clipping to see if it helps.
GPU Workflow
Here, I assume that you have only 1 GPU. And unfortunately, you can't parallelise using single-GPU. To clarify, you shouldn't be concerned about the size of your data for GPU RAM. In most of the single-GPU cases, the current batch stays on the CPU and GPU would only take up the operations. Rather, you should be concerned about the size of parameters that GPU would be computing. Since for 1-layer experiment and 3-layers experiment your operations differ a lot, I don't think it's possible as you can't place multiple ops on same device simultaneously. The best case for you here would be to use a larger batch-size (not too large - as this would reduce the number of back-prop steps in case of training for fixed-epochs), so that you'd cover more data in a single-go.
Just a tip for hyper-paramter tuning, you can consider using Highway-CNNs. These are inspired from gating mechanism of LSTMs where you specify a large number of hidden layers and the network figures out itself on how to control the information flow among the layers. So in short, this would practically eliminate your efforts of tuning the depth of network, and allowing you tune other hyper-params like learning rate or filter-sizes etc.
I hope at least some of this is relevant and helpful to you ;)
Suppose I can train with sample size N, batch size M and network depth L on my GTX 1070 card with tensorflow. Now, suppose I want to train with larger sample 2N and/or deeper network 2L and getting out of memory error.
Will plugging additional GPU cards automatically solve this problem (suppose, that total amount of memory of all GPU cards is sufficient to hold batch and it's gradients)? Or it is impossible with pure tensorflow?
I'v read, that there are bitcoin or etherium miners, that can build mining farm with multiple GPU cards and that this farm will mine faster.
Will mining farm also perform better for deep learning?
Will plugging additional GPU cards automatically solve this problem?
No. You have to change your Tensorflow code to explicitly compute different operations on different devices (e.g: compute the gradients over a single batch on every GPU, then send the computed gradients to a coordinator that accumulates the received gradients and updates the model parameters averaging these gradients).
Also, Tensorflow is so flexible that allows you to specify different operations for every different device (or different remote nodes, it's the same).
You could do data augmentation on a single computational node and let the others process the data without applying this function. You can execute certain operation on a device or set of devices only.
it is impossible with pure tensorflow?
It's possible with tensorflow, but you have to change the code you wrote for a single train/inference device.
I'v read, that there are bitcoin or etherium miners, that can build mining farm with multiple GPU cards and that this farm will mine faster.
Will mining farm also perform better for deep learning?
Blockchains that work using POW (Proof Of Work) requires to solve a difficult problem using a brute-force like approach (they compute a lot's of hash with different inputs until they found a valid hash).
That means that if your single GPU can guess 1000 hash/s, 2 identical GPUs can guess 2 x 1000 hash/s.
The computation the GPUs are doing are completely uncorrelated: the data produced by the GPU:0 is not used by the GPU:1 and there are no synchronization points between the computations. This means that the task that a GPU do can be executed in parallel by another GPU (obviously with different inputs per GPU, so the devices compute hashes to solve different problems given by the network)
Back to Tensorflow: once you modified your code to work with different GPUs, you could train your network faster (in short because you're using bigger batches)
I trained a neural network using a GPU (1080 ti). The training speed on GPU is far better than using CPU.
Currently, I want to serve this model using TensorFlow Serving. I just interested to know if using GPU in the serving process has a same impact on performance?
Since the training apply on batches but inferencing (serving) uses asynchronous requests, do you suggest using GPU in serving a model using TensorFlow serving?
You still need to do a lot of tensor operations on the graph to predict something. So GPU still provides performance improvement for inference. Take a look at this nvidia paper, they have not tested their stuff on TF, but it is still relevant:
Our results show that GPUs provide state-of-the-art inference
performance and energy efficiency, making them the platform of choice
for anyone wanting to deploy a trained neural network in the field. In
particular, the Titan X delivers between 5.3 and 6.7 times higher
performance than the 16-core Xeon E5 CPU while achieving 3.6 to 4.4
times higher energy efficiency.
The short answer is yes, you'll get roughly the same speedup for running on the GPU after training. With a few minor qualifications.
You're running 2 passes over the data in training, which all happens on the GPU, during the feedforward inference you're doing less work, so there will be more time spent transferring data to the GPU memory relative to computations than in training. This is probably a minor difference though. And you can now asynchronously load the GPU if that's an issue (https://github.com/tensorflow/tensorflow/issues/7679).
Whether you'll actually need a GPU to do inference depends on your workload. If your workload isn't overly demanding you might get away with using the CPU anyway, after all, the computation workload is less than half, per sample, so consider the number of requests per second you'll need to serve and test out whether you overload your CPU to achieve that. If you do, time to get the GPU out!
I'm fairly new to Tensorflow in and ML in general and am wondering what strategies I can use to increase performance of an application I am building.
My app is using the Tensorflow C++ interface, with a source compiled TF 0.11 libtensorflow_cc.so (built with bazel build -c opt --copt=-mavx and optionally adding --config=cuda) for either AVX or AVX + CUDA on Mac OS X 10.12.1, on an MacBook Pro 2.8 GHz Intel Core i7 (2 cores 8 threads) with 16GB ram and a Nvidia 750m w/ 2GB VRam)
My application is using Inception V3 model and pulling feature vectors from pool_3 layer. I'm decoding video frames via native API's and passing those in memory buffers to the C++ interface for TF and running them into a session.
I'm not currently batching, but I am caching my session and re-using it for each individual decoded frame / tensor submission. Ive noticed that both CPU and GPU performance is about the same, taking about 40 to 50 seconds to process 222 frames, which seems very slow to me. Ive confirmed CUDA is being invoked, loaded, and the GPU is functioning (or appears so).
Some questions:
In general what should I expect for reasonable performance time wise of TF doing a frame of Inception on a consumer laptop?
How much of a difference does batching make for these operations? For tensors of 1x299x299x3 , I imagine I am doing more PCI transfer waiting than waiting on for meaningful work from the GPU?
if so Is there a good example of batching under C++ for InceptionV3?
Is there operations that cause additional CPU->GPU Syncronization that might otherwise be avoided?
Is there a way to ensure my sessions / graphs share resources ? Can I use nested scopes somehow in this manner? I couldn't quite get that to work but likely missed something.
Any good documentation of general strategies for things to do / avoid?
My code is below:
https://github.com/Synopsis/Synopsis/blob/TensorFlow/Synopsis/TensorFlowAnalyzer/TensorFlowAnalyzer.mm
Thank you very much
For reference, OpenCV analysis using perceptual hash, histogram, dense optical flow, sparse optical flow for point tracking, and simple saliency detection takes 4 to 5 seconds for the same 222 frames using CPU or CPU + OpenCL.
https://github.com/Synopsis/Synopsis/tree/TensorFlow/Synopsis/StandardAnalyzer
Answering your last question first, if there's documentation about performance optimization, yes:
The TensorFlow Performance Guide
The TensorFlow GPU profiling hints
Laptop performance is highly variable, and TF isn't particularly optimized for laptop GPUs. The numbers you're getting (222 frames in 40-50 seconds) ~= 5 fps don't seem crazy on a laptop platform, using the 2016 version of TensorFlow, with inception. With some of the performance improvements outlined in the performance guide above, that should probably be doubled in late 2017.
For batching, yes - the newer example inception model code allows a variable batch size at inference time. This is mostly about whether the model itself was defined to handle a batch size, which is something improved since 2016.
Batching for inference will make a pretty big difference on GPU. Whether it helps on CPU depends a lot -- for example, if you build with MKL-DNN support, batching should be considered mandatory, but basic TensorFlow may not benefit as much.