How does tensorflow handle quantized networks - tensorflow

I have been reading about tensorflow's conversion of neural networks from floats to 8 bit values. Reading the matrix multiplication code in their repository seems to indicate that they are using 8 bit integers rather than fixed floating point which their documentation might have indicated.
I want to understand how exactly it performs the transformation. From what I have read, Am guessing that it scales the weights from 0 to 255. For instance, if we are talking about convolution on an input image which has a range of 0 to 255. The result of the convolution would then be a 32 bit integers which are then scaled back to 0 to 255 using statistics of min and max of the output. Is that correct ?
If so, Why does this work ?
Repository I checked for their code
https://github.com/google/gemmlowp/blob/master/standalone/neon-gemm-kernel-benchmark.cc#L573

I know I'm one year late to answer this question, but this answer may help someone else
Quantization
First, Quantization is the process of converting a continuous range of values (float numbers) to a finite range of discrete values (quantized integers qint). Quantized datatypes are pretty common with embedded systems because most embedded systems have limited resources and to load a trained network (that could be more than 200 MB) on a microcontroller is unachievable. So, we have to find out a way to reduce the size of these trained networks.
Almost all of the size of trained neural networks is taken up with the weights. Because all of the weights are floating-point numbers, simple compression formats like zip don’t compress them well. So, we had to find another way which is “Quantization”.
How is done?
Quantization is done by storing the minimum value and the maximum value for each layer's weights and then compressing each float value to an eight-bit integer representing the closest real number.
For example, assume that the weights of a certain layer in our neural network vary from -4.85 to 2.35 which represent the min and max respectively. Then quantization is done using the following formula:
Then, for example, the number 1.3 and 0 will be:
This simple formula can get the size to shrink by 75%, and as you can see, it’s reversible if we want to convert it back to float after loading so that your existing floating-point code can work without any changes. Moving calculations over to eight-bit will make trained models run faster, and use less power which is essential on embedded systems and mobile devices.
Quantization Vs Precision
Won’t that affect the precision of the model? Apparently, its effect isn’t that big and in this article we can see why. But in short, when we are trying to teach a network, the aim is to have it understand the patterns and discard noise. That means we expect the network to be able to produce good results despite a lot of noise. The networks that emerge from this process have to be very robust numerically, with a lot of redundancy in their calculations so that small differences in input samples don’t affect the results. And that’s what makes neural networks robust when it comes to noise. So, we can consider the quantization error as some kind of noise that well-trained neural networks can handle.

Related

Training Chip and Target Image format in TensorFlow

I am trying to build a Land Cover Classification model for Sentinel Image. The Image Channels(bands) I am using are 32-bit float.
I need to understand how to best format the Image data, both the chips/patches for training and the Target Image for Classification. I have few questions?
Do I need to convert my Original Image and Training Chips from 32bit to other depth?
Do I need to ensure that both the training chips/patches and target have same depth (either 32bit, 16bit or other)?
Do I need to resale my data? I saw some papers where data was rescaled between 0-1 or 0-255?
Does data depth effect the performance of learning and predicting?
Many thanks.
Maz
The best precision to use on a PC is float32 for many reasons like, more precision makes calculation more accurate which is better, but somehow float16 is slower than float32 on PC(I don't remember why) and float64 is unusable slow on regular machines.
So
You usually need to use float32 as input anyway. So if it's float32 in the first place then just use it like that.
You do, but I think they all will got converted to ther desired precision during fit or predict for keras. It's in $HOME/.keras/keras.json.
I don't think it's a need but std centered rescale helps convergence, though, google always simply rescale to -1 to 1.
It does, but as I said, more precision gives better accuracy but it slower.

What is the difference between Floating point 16 and 8 bit quantized in Deep Learning Model

Currently, I am reading this website to understand the Face Detection Models. In this article, it mentioned about the Floating Point 16 and 8 bit quantized version model.
I would like to ask:
What is the difference between two of them?
What is the application of different types of DL model? In which case that we need to use ?
Link to website:
https://www.learnopencv.com/face-detection-opencv-dlib-and-deep-learning-c-python/
What is the difference between two of them?
As you can see, the model size with 8-bit is smaller than the FP16 one, FP16 and 8-bit here means the precision and type of the model weight values.
What is the application of different types of DL model? In which case
that we need to use ?
Normally high precision will make the model size big but may be more accurate, however, in some case when we need to run the model with small latency or on light-weight device, we may want to reduce the model size with some slight lost of accuracy some times.

Must each tensorflow batch contain a uniform distribution of the inputs for all expected classifications?

This is probably a newbie question but I'm trying to get my head around how training on small batches works.
Scenario -
For the mnist classification problem, let's say that we have a model with appropriate hyerparameters that allow training on 0-9 digits. If we feed it with a small batches of uniform distribution of inputs (that have more or less same numbers of all digits in each batch), it'll learn to classify as expected.
Now, imagine that instead of a uniform distribution, we trained the model on images containing only 1s so that the weights are adjusted until it works perfectly for 1s. And then we start training on images that contain only 2s. Note that only the inputs have changed, the model and everything else has stayed the same.
Question -
What does the training exclusively on 2s after the model was already trained exclusively on 1s do? Will it keep adjusting the weights till it has forgotten (so to say) all about 1s and is now classifying on 2s? Or will it still adjust the weights in a way that it remembers both 1s and 2s?
In other words, must each batch contain a uniform distribution of different classifications? Does retraining a trained model in Tensorflow overwrite previous trainings? If yes, if it is not possible to create small (< 256) batches that are sufficiently uniform, does it make sense to train on very large (>= 500-2000) batch sizes?
That is a good question without a clear answer. In general, the order and selection of training samples has a large impact on the performance of the trained net, in particular in respect to the generalization properties it shows.
The impact is so strong, actually, that selecting specific examples, and ordering them in a particular way to maximize performance of the net even constitutes a genuine research area called `curriculum learning'. See this research paper.
So back to your specific question: You should try different possibilities and evaluate each of them (which might actually be an interesting learning exercise anyways). I would expect uniformly distributed samples to generalize well over different categories; samples drawn from the original distribution to achieve the highest overall score (since, if you have 90% samples from one category A, getting 70% over all categories will perform worse than having 99% from category A and 0% everywhere else, in terms of total accuracy); other sample selection mechanisms will show different behavior.
An interesting reading about such questions is Bengio's 2012 paper Practical Recommendations for Gradient-Based Training of Deep
Architectures
There is a section about online learning where the distribution of training data is unknown. I quote from the original paper
It
means that online learners, when given a stream of
non-repetitive training data, really optimize (maybe
not in the optimal way, i.e., using a first-order gradient
technique) what we really care about: generalization
error.
The best practice though to figure out how your dataset behaves under different testing scenarios would be to try them both and get experimental results of how the distribution of the training data affects your generalization error.

Tensorflow: how to find good neural network architectures/hyperparameters?

I've been using tensorflow on and off for various things that I guess are considered rather easy these days. Captcha cracking, basic OCR, things I remember from my AI education at university. They are problems that are reasonably large and therefore don't really lend themselves to experimenting efficiently with different NN architectures.
As you probably know, Joel Grus came out with FizzBuzz in tensorflow. TLDR: learning from a binary representation of a number (ie. 12 bits encoding the number) into 4 bits (none_of_the_others, divisible by 3, divisible by 5, divisible by 15). For this toy problem, you can quickly compare different networks.
So I've been trying a simple feedforward network and wrote a program to compare various architectures. Things like a 2-hidden-layer feedforward network, then 3 layers, different activation functions, ... Most architectures, well, suck. They get somewhere near 50-60 success rate and remain there, independent of how much training you do.
A few perform really well. For instance, a sigmoid-activated double hidden layer with 23 neurons each works really well (89-90% correct after 2000 training epochs). Unfortunately anything close to it is rather disastrously bad. Take one neuron out of the second or first layer and it drops to 30% correct. Same for taking it out of the first layer ... Single hidden layer, 20 neurons tanh activated does pretty well as well. But most have a little over half this performance.
Now given that for real problems I can't realistically do these sorts of studies of different architectures, are there ways to get good architectures guaranteed to work ?
You might find the paper by Yoshua Bengio on Practical Recommendations for Gradient-Based Training of Deep Architectures helpful to learn more about hyperparameters and their settings.
If you're asking specifically for settings that have more guaranteed succes, I advise you to read on Batch Normalization. I find that it decreases the failure rate for bad picks of the learning rate and weight initialization.
Some people also discourage the use of non-linearities like sigmoid() and tanh() as they suffer from the vanishing gradient problem

Neural network weights explode in linear unit

I am currently implementing a simple neural network and the backprop algorithm in Python with numpy. I have already tested my backprop method using central differences and the resulting gradient is equal.
However, the network fails to approximate a simple sine curve. The network hast one hidden layer (100 neurons) with tanh activation functions and a output layer with a linear activation function. Each unit hast also a bias input. The training is done by simple gradient descent with a learning rate of 0.2.
The problem arises from the gradient, which gets with every epoch larger, but I don't know why? Further, the problem is unchanged, if I decrease the learning rate.
EDIT: I have uploaded the code to pastebin: http://pastebin.com/R7tviZUJ
There are two things you can try, maybe in combination:
Use a smaller learning rate. If it is too high, you may be overshooting the minimum in the current direction by a lot, and so your weights will keep getting larger.
Use smaller initial weights. This is related to the first item. A smaller learning rate would fix this as well.
I had a similar problem (with a different library, DL4J), even in the case of extremely simple target functions. In my case, the issue turned out to be the cost function. When I changed from negative log likelihood to Poisson or L2, I started to get decent results. (And my results got MUCH better once I added exponential learning rate decay.)
Looks like you dont use regularization. If you train your network long enough it will start to learn the excact data rather than abstract pattern.
There are a couple of method to regularize your network like: stopped training, put a high cost to large gradients or more complex like e.g.g drop out. If you search web/books you probably will find many options for this.
A too big learning rate can fail to converge, and even DIVERGE, that is the point.
The gradient could diverge for this reason: when exceeding the position of the minima, the resulting point could not only be a bit further, but could even be at a greater distance than initially, but the other side. Repeat the process, and it will continue to diverge. in other words, the variation rate around the optimal position could be just to big compared to the learning rate.
Source: my understanding of the following video (watch near 7:30).
https://www.youtube.com/watch?v=Fn8qXpIcdnI&list=PLLH73N9cB21V_O2JqILVX557BST2cqJw4&index=10