Mixed Precision Training using Jax - tensorflow

I'm trying to understand how did Haiku achieve 2x speedup when training ResNet50 on ImageNet https://github.com/deepmind/dm-haiku/tree/main/examples/imagenet using the Deepmind JMP lib https://github.com/deepmind/jmp, and how to replicate this with other networks.
1- If we compare the time needed for a matrix multiplication in float32 and float16 on a GPU, we can barely see a 5% speedup, how can this become a 2x speedup as we scale the number of operations ?
2- Why do we need to apply mixed precision also on the network ? If you data and parameters are in float16 then aren't all ops inside the neural network in float16 too ?
3- Can I hope to see any speedup with a small fully connected network ? deep fully connected network ? Or only big vision-related neural network optimized specifically for that ?

Related

YoloV3 deployment on JETSON TX2

I faced problem regarding Yolo object detection deployment on TX2.
I use pre-trained Yolo3 (trained on Coco dataset) to detect some limited objects (I mostly concern on five classes, not all the classes), The speed is low for real-time detection, and the accuracy is not perfect (but acceptable) on my laptop. I’m thinking to make it faster by multithreading or multiprocessing on my laptop, is it possible for yolo?
But my main problem is that algorithm is not running on raspberry pi and nvidia TX2.
Here are my questions:
In general, is it possible to run yolov3 on TX2 without any modification like accelerators and model compression techniques?
I cannot run the model on TX2. Firstly I got error regarding camera, so I decided to run the model on a video, this time I got the 'cannot allocate memory in static TLS block' error, what is the reason of getting this error? the model is too big. It uses 16 GB GPU memory on my laptop.The GPU memory of raspberry and TX2 are less than 8GB. As far as I know there are two solutions, using a smaller model or using tensor RT or pruning. Do you have any idea if there is any other way?
if I use tiny-yolo I will get lower accuracy and this is not what I want. Is there any way to run any object detection model with high performance for real-time in terms of both accuracy and speed (FPS) on raspberry pi or NVIDIA TX2?
If I clean the coco data for just the objects I concern and then train the same model, I would get higher accuracy and speed but the size would not change, Am I correct?
In general, what is the best model in terms of accuracy for real-time detection and what is the best in terms of speed?
How is Mobilenet? Is it better than YOLOs in terms of both accuracy and speed?
1- Yes it is possible. I already run Yolov3 on Jetson Nano.
2- It depends on model and input resolution of data. You can decrease input resolution. Input images are transferred to GPU VRAM to use on model. Big input sizes can allocate much memory. As far as I remember I have run normal Yolov3 on Jetson Nano(which is worse than tx2) 2 years ago. Also, you can use Yolov3-tiny and Tensorrt as you mention. There are many sources on the web like this & this.
3- I suggest you to have a look at here. In this repo, you can make transfer learning with your dataset & optimize the model with TensorRT & run it on Jetson.
4- Size not dependent to dataset. It depend the model architecture(because it contains weights). Speed probably does not change. Accuracy depends on your dataset. It can be better or worser. If any class on COCO is similiar to your dataset's any class, I suggest you to transfer learning.
5- You have to find right model with small size, enough accuracy, gracefully speed. There is not best model. There is best model for your case which depend on also your dataset. You can compare some of the model's accuracy and fps here.
6- Most people uses mobilenet as feature extractor. Read this paper. You will see Yolov3 have better accuracy, SSD with MobileNet backbone have better FPS. I suggest you to use jetson-inference repo.
By using jetson-inference repo, I get enough accuracy on SSD model & get 30 FPS. Also, I suggest you to use MIPI-CSI camera on Jetson. It is faster than USB cameras.
I fixed the problem 1 and 2 only by replacing import order of the opencv and tensorflow inside the script.Now I can run Yolov3 without any modification on tx2. I got average FPS of 3.

Is it possible to significantly reduce the inference time of images by reducing the number of object classes?

I am using YOLOv4 to train my custom detector. Source: https://github.com/AlexeyAB/darknet
One of the issues while training is the computing power of GPU and available video RAM. What is the relationship between number of object classes and the time it takes to train the model? Also, is it possible to significantly reduce the inference time of images by reducing the number of object classes? The goal is to run inference on a Raspberry Pi or a Jetson Nano.
Any help is much appreciated. Thanks.
Change is number of classes doesn't have significant impact on
inference time.
For example in case of Yolov4, which has got 3 Yolo layers, change in classes leads to change in filter size for conv layers preceding Yolo layers and some computation reduction within Yolo layers that's all. This is very minute compared to overall inference time as conv layers preceding Yolo layers are bottom layers with very small width and hight and also time spent on logic that depends upon number of classes within Yolo layer is very less.
Here:
filters=(classes + 5)x3
Note that tinier version of yolov4 i.e tiny-yolov4 have got two Yolo layers only, instead of 3.
If your intent is to reduce inference time, especially on raspberry pi or a jetson nano, without losing on accuracy/mAP, do following things:
Quantisation: Run inference with INT8 instead of FP32. You can use this repo for this purpose. You can do this for both Jetson nano and raspberry pi.
Use inference library such as tkDNN, which is a Deep Neural Network library built with cuDNN and tensorRT primitives, specifically thought to work on NVIDIA Jetson Boards. You can use this for Jetson nano. Note that with TensorRT, you can use INT8 and FP16 instead of FP32 to reduce detection time.
Following techniques can be used to reduce inference time, but they come at the cost of significant drop in accuracy/mAP:
You can train the models with tinier versions rather than full Yolo versions.
Model Pruning - If you could rank the neurons in the network according to how much they contribute, you could then remove the low ranking neurons from the network, resulting in a smaller and faster network. Pruned yolov3 research paper and it's implementation. This is another pruned Yolov3 implementation.
I tried reducing the number of classes from 80 to 5 classes, I was aiming to detect vehicles only, on YOLOv3 and found a reduction in time. For example, using Intel Core i5-6300U CPU # 2.40 GHz, the time was reduced by 50%, and for Nvidia GeForce 930M, it was reduced by 20%. Generally, the stronger the processor, the less reduction in time you get.

Large input image limitations for VGG19 transfer learning

I'm using the Tensorflow (using the Keras API) in Python 3.0. I'm using the VGG19 pre-trained network to perform style transfer on an Nvidia RTX 2070.
The largest input image that I have is 4500x4500 pixels (I have removed the fully-connected layers in the VGG19 to allow for a fully-convolutional network that handles arbitrary image sizes.) If it helps, my batch size is just 1 image at a time currently.
1.) Is there an option for parallelizing the evaluation of the model on the image input given that I am not training the model, but just passing data through the pre-trained model?
2.) Is there any increase in capacity for handling larger images in going from 1 GPU to 2 GPUs? Is there a way for the memory to be shared across the GPUs?
I'm unsure if larger images make my GPU compute-bound or memory-bound. I'm speculating that it's a compute issue, which is what started my search for parallel CNN evaluation discussions. I've seen some papers on tiling methods that seem to allow for larger images

Suggest some useful techniques to reduce the size of a CNN architecture?

Context: I am going to start training a CNN to classify a data set. This CNN will have to be deployed for a real world application. So a forward propagation through this CNN has to be fast. Most of the CNN architectures I have read cannot run without a GPU and need a lot of costly resources to be deployed.
Question:
Now I know one particular technique that's quite useful for reducing the size of a CNN architecture: Downsize the image using cubic interpolation ( Cubic interpolation helps improve certain image features like edges ). This helps reduce the number of convolution layers as well as the filter size thus reducing the overall parameters in a CNN by quite a lot. I wanted to know if there are other techniques which can make a CNN smaller so that it can be realistically deployed.
Binarization techniques are effective algorithms which allow to constrain both the parameters and the activations of a network to have binary values. Obviously the precision loss may degrade a bit the final performances, but the binary representation reduces a lot the resource requirements of the network.
For instance, you can have a look at these works:
Binarized Neural Networks
Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to +1 or −1
XNOR-Net: ImageNet Classification Using Binary
Convolutional Neural Networks
which released their code.

Training Resnet deep neural network from scratch

I need to gain some knowledge about deep neural networks.
For a 'ResNet' very deep neural network, we can use transfer learning to train a model.
But Resnet has been trained over the ImageNet dataset. So their pre-trained weights can be used to train the model with another dataset. (for an example training a model for lung cancer detection with CT lung images)
I feels that this approach will be not accurate as the pre-trained weights has been completely trained over other objects but not with medical data.
Instead of transfer learning, is it possible to train the resnet from scratch? (but the available number of images to train the resnet is around 1500) . Is it something possible to do with a normal computer.
Can someone please share your valuable ideas with me
is it possible to train the resnet from scratch?
Yes, it is possible, but the amount of time one needs to get to good accuracy greatly depends on the data. For instance, training original ResNet-50 on a NVIDIA M40 GPU took 14 days (10^18 single precision ops). The most expensive operation in CNN is the convolution in the early layers.
ImageNet contains 14m 226x226x3 images. Since your dataset is ~10000x smaller, each epoch will take ~10000x less ops. On top of that, if you pass gray-scale instead of RGB images, the first convolution will take 3x less ops. Likewise spatial image size affects the training time as well. Training on smaller images can also increase the batch size, which usually speeds things up due to vectorization.
All in all, I estimate that a machine with a single consumer GPU, such as 1080 or 1080ti, can train ~100 epochs of ResNet-50 model in a day. Obviously, training on a 2-GPU machine would be even faster. If that is what you mean by a normal computer, the answer is yes.
But since your dataset is very small, there's a big chance of overfitting. This looks like the biggest issue that your approach faces.