speeden up Haar Cascade training process - gpu

Training a haar cascade takes a lot of time and also the entire training period depends on the machine.
What are the factors that contribute to speeding up the process?
Does having more RAM and a GPU help?
Does haar cascade training have GPU support like tensorflow does?

Opencv documentation here state that
LBP features yield integer precision in contrast to HAAR features, yielding floating point precision, so both training and detection with LBP are several times faster then with HAAR features. Regarding the LBP and HAAR detection quality, it mainly depends on the training data used and the training parameters selected. It's possible to train a LBP-based classifier that will provide almost the same quality as HAAR-based one, within a percentage of the training time.
So while training, instead of using old opencv_haartraining tool use opencv_traincascade tool with -featureType LBP as a parameter(default is HAAR)
Also, you can use -precalcValBufSize and -precalcIdxBufSize parameters to assign specific amount of memory for training. The more memory you assign the faster the training process, however keep in mind that -precalcValBufSize and -precalcIdxBufSize combined should not exceed your available system memory.

Related

Is it possible to estimate the time needed to train a machine learning model given a size of data and hardware specification?

I am planning to make small Tensor Flow image classification project, which is expected to run on machines with low processing power, and one of the concerns I was asked about was the time needed to train the model.
The project is still in the conception stage and no clear boundary is made.
But assuming that we will use Tensor flow for Python, with a simple Neural Network for say n images data set, is there a way to estimate or predict the time required to train the model before performing the training given the hardware in use?
I have asked one of my colleagues who works in NN and he said that maybe we could calculate the time needed by measuring the time for the first epoch and making an estimation how many epochs needed afterwards. Is this is a valid way? If yes then is it even possible to estimate the number of epochs needed? And either cases is there a way to calculate it before performing any training?
There is no definite way of finding the number of epochs to which the model converges. It is one of the hyperparameter.
Apart from the type of model you are training, convergence also depends on the distribution of data, and the optimizer you are using.
The rough estimate you can make by looking at the number of parameters you have in your model, check time for one epoch, and get a rough idea from "experience" on the number of epochs. BUT you always have to look at the training and validation loss curves to check for the convergence.

How to select batch size automatically to fit GPU?

I am training deep neural networks with a GPU. If I make samples too large, batches too large, or networks too deep, I get an out of memory error. In this case, it is sometimes possible to make smaller batches and still train.
Is it possible to calculate GPU size required for training and determine what batch size to choose beforehand?
UPDATE
If I print network summary, it displays number of "trainable parameters". Can't I estimate from this value? For example, take this, multiply by batch size, double for gradients etc?
PyTorch Lightning recently added a feature called "auto batch size", especially for this! It computes the max batch size that can fit into the memory of your GPU :)
More info can be found here.
Original PR: https://github.com/PyTorchLightning/pytorch-lightning/pull/1638
No, it is not possible to do this automatically. So you need to go through a lot of trial and error to find appropriate size if you want your batch to be as much as possible.
Stanford's CNN class provides some guidance how to estimate the memory size, but all suggestions are related to CNN (not sure what do you train).
I think Salvador here means that it is not possible to analytically compute the best suited batch size, however, as all things are in ML, it is just another hyperparameter, that can be added to your grid search to be computed automatically. Simply evaluate your model's loss or accuracy (however you measure performance) for the best and most stable (least variable) measure given several batch sizes, say some powers of 2, such as 64, 256, 1024, etc. Then keep use the best found batch size. Note that batch size can depend on your model's architecture, machine hardware, etc. For example, if you move your modeling from a local PC to some cloud compute engine (GCP, AWS, Azure,...), then the batch size which was too large for your PC's RAM becomes easily suitable for practically limitless RAM/CPU/GPU (mind the costs).

Selecting tensorflow object detection API training hyper parameters

I am setting up an object detection pipeline based on recently released tensorflow object detection API. I am using the arXiv as guidance. I am looking to understand the below for training on my own dataset.
It is not clear how they selected the learning rate schedules and how that would change based on the number of GPUs available for training. How do the training rate schedule change based on number of GPU's available for training? The paper mentions 9 GPUs are used. How should I change the training rate if I only want to use 1 GPU?
The released sample training config file for Pascal VOC using Faster R-CNN has initial learning rate = 0.0001. This is 10x lower than what was published in the original Faster-RCNN paper. Is this due to an assumption on the number of GPU's available for training or due to a different reason?
When I start training from the COCO detection checkpoint, how should the training loss decrease? Looking at tensorboard, on my dataset training loss is low - between 0.8 to 1.2 per iteration (with batch size of 1). Below image shows the various losses from tensorboard. . Is this expected behavior?
For questions 1 and 2: our implementation differs in a few small details compared to the original paper and internally we train all of our detectors with asynchronous SGD with ~10 GPUs. Our learning rates are calibrated for this setting (which you will also have if you decide to train via Cloud ML Engine as in the Pets walkthrough). If you use another setting, you will have to do a bit of hyperparameter exploration. For a single GPU, leaving the learning rate alone probably won't hurt performance, but you may be able to get faster convergence by increasing it.
For question 3: Training losses decrease erratically and you can only see the decrease if you smooth the plots quite a bit over time. Moreover, it's hard to explicitly say how well you are doing with respect to eval metrics just by looking at the training losses. I recommend looking at the mAP plots over time as well as the image visualizations to really get an idea of whether your model has "lifted off".
Hope this helps.

TensorFlow RNN training 100% CPU while only using 60% GPU

I'm working on code that trains a relatively large RNN (128 cell LSTM and some added layers). The main process is maxing out a core on the CPU, and I'm wondering if this is normal or whether I can optimize it. During the training loop (session.run calls) it's using about 60-70% GPU load while using 100% CPU load on one core. Note that data sampling work is already being done concurrently on other cores, so it's just the updating of the model parameters. Is this regular for such applications in TensorFlow or should the CPU load be much lower, while using the full capacity of the GPU?
We don't have full documentation on it yet, but you can take a look at the profiling information to see if it gives you more of an idea of where the time is going:
https://github.com/tensorflow/tensorflow/issues/1824#issuecomment-225754659
I think RNN cell have two input, it must wait for those two direction input when traning data, in other word, it optimize parallelism don't as easy as CNN. You can use a big batch size to improve the GPU utilization rate, but maybe cause other problem like that paper On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima.

What is the advantage of doing a Multi-GPU training in TensorFlow?

In this TensorFlow tutorial, you can use N number of GPUs to distribute N mini-batches (each containing M training samples) to each GPU and calculate the gradients concurrently.
Then you average the gradients collected from N GPUs and update the model parameters.
But this has the same effect as using a single GPU to calculate the gradients of N*M training samples, then updating the parameters.
So the only advantage seems to me is that you can use a larger-sized mini-batch in the same amount of time.
But is the larger-sized mini-batch necessarily better?
I thought you shouldn't use a large-sized mini-batch, in order to make the optimization more robust to saddle points.
If the larger-sized mini-batch is indeed not better, why would you care about Multi-GPU learning, or even Multi-server learning?
(The tutorial above is a synchronous training. If it was asynchronous training, then I can see the merit, since the parameters will be updated without averaging the gradients calculated by each GPU)
The main purpose for multi-GPU learning is to enable you train on large data set in shorter time. It is not necessarily better with larger mini-batch, but at least you can finish learning in a more feasible time.
More precisely, those N mini-batches are not trained in a synchronized way if you use Asynchronous SGD algorithm. As the algorithm changes when using multi-GPU, it is not equal to using MxN size mini-batch on single-GPU with SGD algorithm.
If you use sync multi-GPU training, the benefit is mainly time reduction. You could use M/N-size mini-match to maintain the effective mini-batch size, and of course the scalability is limited as smaller mini-batch size leads to more overhead. Data-exchange and synchronization on large number of computing nodes are also disasters.
Finally to solve the scalability issue, people move to A-SGD when using large number of GPUs concurrently. So probably you won't see someone using sync multi-GPU training on hundreds of (or even tens of) GPUs.
More gpu means more data in a batch. And the gradients of a batch data is averaged for back-propagation.
If the learning rate of a batch is fixed, then the learning rate of a data is smaller.
If the learning rate of a data is fixed, then the learning rate of a batch is larger.
https://github.com/guotong1988/BERT-GPU