How to use the quantized Tensorflow MobileNet v1 floating point scaling values - tensorflow

There are quantized MobileNet v1 models available at https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet_v1.md
I see floating point scaling values associated with the weights and biases in the model, but it isn't evident how these should be used in the operations scaling.
The GEMMLOWP quantization info describes scaling values associated with input, weight and the operation's accumulator downscale.
Should the bias scaling value be used alone for down-scaling the accumulator, or is the weight scaling value required?
In short, I'm trying to determine how the two provided scaling values should be used.
Thanks.

Related

Keras Masking layer for LSTM input to mask features instead of timesteps

I gather that Masking layers in Keras are commonly used for handling data inputs with varying timesteps. Based on the documentation, I understand that if all of the features for a given timestep equal the mask value, then that timestep will be skipped in downstream layers.
For my problem, I am instead interested in using masking for features, where the data input shape to the network is (batch_size, num_timesteps, num_features). Essentially, I want to be able to predict a timeseries one step into the future with num_features features, but assuming that I won't always have all the features from the previous timestep to base my prediction on.
For example, one could predict RGB values one timestep into the future for a pixel in a video stream based on partial data from a previous timestep. At every timestep the output should be all RGB, but some timesteps you may get only RG, or only RB, or only BG, but you never know which partial data you'll have at each timestep to make your prediction. This is why I want to somehow be able to indicate a feature as masked during training to accommodate this kind of prediction.
It may be that Masking in Keras is not the correct mechanism to achieve this. What is the correct type of network layer that would give me this behavior?

What's the relationship between Tensorflow's dataflow graph and DNN?

As we know, a DNN is comprised of many layers which consist of many neurons applying the same function to different parts of the input. Meanwhile, if we use Tensorflow to execute a DNN task, we will get a dataflow graph generated by Tensorflow automatically and we can use Tensorboard to visualize the dataflow graph as blow. But there is no neuron in the layer. So I wonder what is the relationship between Tensorflow dataflow graph and a DNN? When a neuron of DNN's layer map into dataflow graph, how is it represented?What is the relationship of neuron in DNN and node in tensorflow(representing an operation)? I just started to learn DNN and Tensorflow, please help me arrange thoughts in order. Thanks:) enter image description here
You have to differentiate between the metaphoric representation of a DNN and it's mathematic description. The math behind a classic neuron is the sum of the weighted inputs + a bias (usually calling a activation function on this result)
So in this case you have an input vector mutplied by a weight vector (containing trainable variables) and then summed up with a bias scalar (also trainable)
If you now consider a layer of neurons instead of one, the weights will become a matrix and the bias a vector. So calculating a feed forward layer is nothing more then a matrix multiplication follow by a sum of vectors.
This is the operation you can see in your tensorflow graph.
You can actually build your Neural Network this way without any use of the so called High Level API which use the Layer abstraction. (Many have done this in the early days of tensorflow)
The actual "magic", which tensorflow does for you is calculating and executing the derivatives of this foreword pass in order to calculate the updates for the weights.

DropoutWrapper in tensorflow and test time scaling

I learnt in the CS231n class that during the test time we need to scale the activations by the same factor we used as a dropout probability of training. When using the DropoutWrapper in tensorflow I don't see any parameter that will allow me to do this test time scaling.
Why is it missing? Is it necessary for RNNs? What is the right way to do it?
You don't need scaling on inference, because tf.nn.dropout apply scale at the train time (from tf.nn.dropout documentation ):
With probability keep_prob, outputs the input element scaled up by 1 / keep_prob, otherwise outputs 0. The scaling is so that the expected sum is unchanged.

Training quantized models in TensorFlow

I would like to train a quantized network, i.e. use quantized weights during the forward pass to calculate the loss and then update the underlying full-precision floating point weights during the backward pass.
Note that in my case "fake quantization" is sufficient. That means that the weights can still be stored as 32-bit floating point values, as long as they represent a low bitwidth quantized value.
In a blog post from Pete Warden he states:
[...] we do have support for “fake quantization” operators. If you include these in your graphs at the points where quantization is expected to occur (for example after convolutions), then in the forward pass the float values will be rounded to the specified number of levels (typically 256) to simulate the effects of quantization.
The mentioned operators can be found in the TensorFlow API.
Can anybody point out to me how to use these functions?
If I call them after e.g. a conv layer in my model definition, why would this quantize the weights in the layer instead of the outputs (activations) of this layer?

patch-wise training and fully convolutional training in FCN

In the FCN paper, the authors discuss the patch wise training and fully convolutional training. What is the difference between these two?
Please refer to section 4.4 attached in the following.
It seems to me that the training mechanism is as follows,
Assume the original image is M*M, then iterate the M*M pixels to extract N*N patch (where N<M). The iteration stride can some number like N/3 to generate overlapping patches. Moreover, assume each single image corresponds to 20 patches, then we can put these 20 patches or 60 patches(if we want to have 3 images) into a single mini-batch for training. Is this understanding right? It seems to me that this so-called fully convolutional training is the same as patch-wise training.
The term "Fully Convolutional Training" just means replacing fully-connected layer with convolutional layers so that the whole network contains just convolutional layers (and pooling layers).
The term "Patchwise training" is intended to avoid the redundancies of full image training.
In semantic segmentation, given that you are classifying each pixel in the image, by using the whole image, you are adding a lot of redundancy in the input. A standard approach to avoid this during training segmentation networks is to feed the network with batches of random patches (small image regions surrounding the objects of interest) from the training set instead of full images. This "patchwise sampling" ensures that the input has enough variance and is a valid representation of the training dataset (the mini-batch should have the same distribution as the training set). This technique also helps to converge faster and to balance the classes. In this paper, they claim that is it not necessary to use patch-wise training and if you want to balance the classes you can weight or sample the loss.
In a different perspective, the problem with full image training in per-pixel segmentation is that the input image has a lot of spatial correlation. To fix this, you can either sample patches from the training set (patchwise training) or sample the loss from the whole image. That is why the subsection is called "Patchwise training is loss sampling".
So by "restricting the loss to a randomly sampled subset of its spatial terms excludes patches from the gradient computation." They tried this "loss sampling" by randomly ignoring cells from the last layer so the loss is not calculated over the whole image.