Use tf.layers.batch_normalization to preprocess inputs for SELU activation function? - tensorflow

The SELU activation function (https://github.com/bioinf-jku/SNNs/blob/master/selu.py) requires the input to be normalized to have the mean value of 0.0 and the variance of 1.0. Therefore, I tried to apply tf.layers.batch_normalization (axis=-1) on the raw data to meet that requirement. The raw data in each batch have the shape of [batch_size, 15], where 15 refers to the number of features. The graph below shows the variances of 5 of these features returned from tf.layers.batch_normalization (~20 epochs). They are not all close to 1.0 as expected. The mean values are not all close to 0.0 as well (graphs not shown).
How should I get the 15 features all normalized independently (I expect every feature after normalization will have mean = 0 and var = 1.0)?

After reading the original papers of batch normalization (https://arxiv.org/abs/1502.03167) and SELU (https://arxiv.org/abs/1706.02515), I have a better understanding of them:
batch normalization is an "isolation" procedure to ensure the input (in any mini-batch) to the next layer has a fixed distribution, therefore the so called "shifting variance" problem is fixed. The affine transform ( γ*x^ + β ) just tunes the standardized x^ to another fixed distribution for better expressiveness. For the simple normalization, we need to turn the center and scale parameters to False when calling tf.layers.batch_normalization.
Make sure the epsilon (still in tf.layers.batch_normalization) is set to at least 2 magnitudes less than the lowest magnitude of the all input data. The default value of epsilon is set to 0.001. For my case, some features have values as low as 1e-6. Therefore, I had to change epsilon to 1e-8.
The inputs to SELU have to be normalized before feeding them into the model. tf.layers.batch_normalization is not designed for that purpose.

Related

What is the significance of normalization of data before feeding it to a ML/DL model?

I just started learning Deep Learning and was working with the Fashion MNIST data-set.
As a part of pre-processing the X-labels, the training and test images, dividing the pixel values by 255 is included as a part of normalization of the input data.
training_images = training_images/255.0
test_images = test_images/255.0
I understand that this is to scale down the values to [0,1] because neural networks are more efficient while handling such values. However, if I try to skip these two lines, my model predicts something entire different for a particular test_image.
Why does this happen?
Let's see both the scenarios with the below details.
1. With Unnormaized data:
Since your network is tasked with learning how to combine inputs through a series of linear combinations and nonlinear activations, the parameters associated with each input will exist on different scales.
Unfortunately, this can lead toward an awkward loss function topology which places more emphasis on certain parameter gradients.
Or in a simple definition as Shubham Panchal mentioned in comment.
If the images are not normalized, the input pixels will range from [ 0 , 255 ]. These will produce huge activation values ( if you're using ReLU ). After the forward pass, you'll end up with a huge loss value and gradients.
2. With Normalized data:
By normalizing our inputs to a standard scale, we're allowing the network to more quickly learn the optimal parameters for each input node.
Additionally, it's useful to ensure that our inputs are roughly in the range of -1 to 1 to avoid weird mathematical artifacts associated with floating-point number precision. In short, computers lose accuracy when performing math operations on really large or really small numbers. Moreover, if your inputs and target outputs are on a completely different scale than the typical -1 to 1 range, the default parameters for your neural network (ie. learning rates) will likely be ill-suited for your data. In the case of image the pixel intensity range is bound by 0 and 1(mean =0 and variance =1).

dropout with relu activations

I am trying to implement a neural network with dropout in tensorflow.
tf.layers.dropout(inputs, rate, training)
From the documentation: "Dropout consists in randomly setting a fraction rate of input units to 0 at each update during training time, which helps prevent overfitting. The units that are kept are scaled by 1 / (1 - rate), so that their sum is unchanged at training time and inference time."
Now I understand that this behavior if dropout is applied on top of sigmoid activations that are strictly above zero. If half of the input units are zeroed, the sum of all the outputs will be also halved so it makes sense to scale them by factor of 2 in order to regain some kind of consistency before the next layer.
Now what if one uses the tanh activation which is centered around zero? The reasoning above no longer holds true so is it still valid to scale the output of dropout by the mentioned factor? Is there a way to prevent tensorflow dropout from scaling the outputs?
Thanks in advance
If you have a set of inputs to a node and a set of weights, their weighted sum is a value, S. You can define another random variable by selecting a random fraction f of the original random variables. The weighted sum using the same weights of the random variable defined in this way is S * f. From this, you can see the argument for rescaling is precise if the objective is that the mean of the sum remains the same with and without scaling. This would be true when the activation function is linear in the range of the weighted sums of subsets, and approximately true if the activation function is approximately linear in the range of the weighted sum of subsets.
After passing the linear combination through any non-linear activation function, it is no longer true that rescaling exactly preserves the expected mean. However, if the contribution to a node is not dominated by a small number of nodes, the variance in the sum of a randomly selected subset of a chosen, fairly large size will be relatively small, and if the activation function is approximately linear fairly near the output value, rescaling will work well to produce an output with approximately the same mean. Eg the logistic and tanh functions are approximately linear over any small region. Note that the range of the function is irrelevant, only the differences between its values.
With relu activation, if the original weighted sum is close enough to zero for the weighted sum of subsets to be on both sides of zero, a non-differentiable point in the activation function, rescaling won't work so well, but this is a relatively rare situation and limited to outputs that are small, so may not be a big problem.
The main observations here are that rescaling works best with large numbers of nodes making significant contributions, and relies on local approximate linearity of activation functions.
The point of setting the node to have an output of zero is so that neuron would have no effect on the neurons being fed by it. This would create sparsity and hence, attempts to reduce overfitting. When using sigmoid or tanh, the value is still set to zero.
I think your approach of reasoning here is incorrect. Think of contribution rather than sum.

Choosing initial values for variables and parameters for optimizers in tensorflow

How do people typically choose initial values for their variables and parameters? Do we just tinker till it works?
I was following the Getting Started tutorial for tensorflow, and was able to train the linear model in it. However, I noticed that the starting values for the variables W, b were reasonably close to the ground truth.
When I change the data to make the ground truth values much further away, the gradient descent optimizer gives me NaN values for W, b.
However, in general, I don't think it would be reasonable to be able to guess the initial values of the variables in the model. Seems like I should be able to choose any arbitrary starting point and get to where I want.
I was thinking my choice in my parameters might be bad. However, I am not sure in what way to adjust this. The default was 0.01, I've tried values from 0.001 to 100.
Would there be a discussion of optimization parameter choices and initial values for model variables in a general machine learning book? Really I am just looking for resources.
Thanks!
Some of the famous initializers for Convolutional Neural Networks:
Glorot Normal: Also called Xavier. Normal distribution centered on 0 with stddev = sqrt(2 / (fan_in + fan_out)) where fan_in is the number of input units in the weight tensor and fan_out is the number of output units in the weight tensor.
http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf
Lecun Uniform: Uniform distribution within [-limit, limit] where limit is sqrt(3 / fan_in) where fan_in is the number of input units in the weight tensor.
http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf
He Normal:
Truncated normal distribution centered on 0 with stddev = sqrt(2 / fan_in) where fan_in is the number of input units in the weight tensor.
http://arxiv.org/abs/1502.01852
Along with these initializers, one have to search for learning rate, momentum and other hyperparameters.

TensorFlow - Batch normalization failing on regression?

I'm using TensorFlow for a multi-target regression problem. Specifically, in a convolutional network with pixel-wise labeling with the input being an image and the label being a "heat-map" where each pixel has a float value. More specifically, the ground truth labeling for each pixel is lower bounded by zero, and, while technically having no upper bound, usually gets no larger than 1e-2.
Without batch normalization, the network is able to give a reasonable heat-map prediction. With batch normalization, the network takes much long to get to reasonable loss value, and the best it does is making every pixel the average value. This is using the tf.contrib.layers conv2d and batch_norm methods, with the batch_norm being passed to the conv2d's normalization_fn (or not in the case of no batch normalization). I had briefly tried batch normalization on another (single value) regression network, and had trouble then as well (though, I hadn't tested that as extensively). Is there a problem using batch normalization on regression problems in general? Is there a common solution?
If not, what could be some causes batch normalization failing on such an application? I've attempted a variety of initializations, learning rates, etc. I would expect the final layer (which of course does not use batch normalization) could use weights to scale the output of the penultimate layer to the appropriate regression values. Failing that, I removed batch norm from that layer, but with no improvement. I've attempted a small classification problem using batch normalization and saw no problem there, so it seems reasonable that it could be due somehow to the nature of the regression problem, but I don't know how that could cause such a drastic difference. Is batch normalization known to have trouble on regression problems?
I believe your issue is in the labels. Batch norm will scale all input values between 0 and 1. If the labels are not scaled to a similar range the task will be more difficult. This is because it requires the NN to learn values of a different scale.
By removing the batch norm from the penultimate layer, the task may be improved slightly, but you are still requiring an NN layer to learn to downscale values of its input while subsequently normalizing back to the range 0 - 1 (opposite to your objective).
To solve this problem, apply a 0 - 1 scaler to the labels such that your upper bound is no longer 1e-2. During inference, transform the predictions back with the same function to get the actual prediction.

Multilabel image classification with sparse labels in TensorFlow?

I want to perform a multilabel image classification task for n classes.
I've got sparse label vectors for each image and each dimension of each label vector is currently encoded in this way:
1.0 ->Label true / Image belongs to this class
-1.0 ->Label false / Image does not contain to this class.
0.0 ->missing value/label
E.g.: V= {1.0,-1.0,1.0, 0.0}
For this example V the model should learn, that the corresponding image should be classified in the first and third class.
My problem is currently how to handle the missing values/labels. I've searched through the issues and found this issue:
tensorflow/skflow#113 found here
So could do multilable image classification with:
tf.nn.sigmoid_cross_entropy_with_logits(logits, targets, name=None)
but TensorFlow has this error function for sparse softmax, which is used for exclusive classification:
tf.nn.sparse_softmax_cross_entropy_with_logits(logits, labels, name=None)
So is there something like sparse sigmoid cross entropy? (Couldn't find something) or any suggestions how can I handle my multilabel classification problem with sparse labels.
I used weighted_cross_entropy_with_logits as the loss function with positive weights for 1s.
In my case, all the labels are equally important. But 0 was ten times more likely to be appeared as the value of any label than 1.
So I weighed all the 1s by calling the pos_weight parameter of the aforementioned loss function. I used a pos_weight (= weight on positive values) of 10. By the way, I do not recommend any strategy to calculate the pos_weight. I think it will depend explicitly on the data in hand.
if real label = 1,
weighted_cross_entropy = pos_weight * sigmoid_cross_entropy
Weighted cross entropy with logits is same as the Sigmoid cross entropy with logits, except for the extra weight value multiplied to all the targets with a positive real value i.e.; 1.
Theoretically, it should do the job. I am still tuning other parameters to optimize the performance. Will update with performance statistics later.
First I would like to know what you mean by missing data? What is the difference between miss and false in your case?
Next, I think it is wrong that you represent your data like this. You have unrelated information that you try to represent on the same dimension. (If it was false or true it would work)
What seems to me better is to represent for each of your class a probability if it is good, or is missing or is false.
In your case V = [(1,0,0),(0,0,1),(1,0,0),(0,1,0)]
Ok!
So your problem is more about how to handle the missing data I think.
So I think you should definitely use tf.sigmoid_cross_entropy_with_logits()
Just change the target for the missing data to 0.5. (0 for false and 1 for true).
I never tried this approach but it should let your network learn without biasing it too much.