What is the significance of normalization of data before feeding it to a ML/DL model? - tensorflow

I just started learning Deep Learning and was working with the Fashion MNIST data-set.
As a part of pre-processing the X-labels, the training and test images, dividing the pixel values by 255 is included as a part of normalization of the input data.
training_images = training_images/255.0
test_images = test_images/255.0
I understand that this is to scale down the values to [0,1] because neural networks are more efficient while handling such values. However, if I try to skip these two lines, my model predicts something entire different for a particular test_image.
Why does this happen?

Let's see both the scenarios with the below details.
1. With Unnormaized data:
Since your network is tasked with learning how to combine inputs through a series of linear combinations and nonlinear activations, the parameters associated with each input will exist on different scales.
Unfortunately, this can lead toward an awkward loss function topology which places more emphasis on certain parameter gradients.
Or in a simple definition as Shubham Panchal mentioned in comment.
If the images are not normalized, the input pixels will range from [ 0 , 255 ]. These will produce huge activation values ( if you're using ReLU ). After the forward pass, you'll end up with a huge loss value and gradients.
2. With Normalized data:
By normalizing our inputs to a standard scale, we're allowing the network to more quickly learn the optimal parameters for each input node.
Additionally, it's useful to ensure that our inputs are roughly in the range of -1 to 1 to avoid weird mathematical artifacts associated with floating-point number precision. In short, computers lose accuracy when performing math operations on really large or really small numbers. Moreover, if your inputs and target outputs are on a completely different scale than the typical -1 to 1 range, the default parameters for your neural network (ie. learning rates) will likely be ill-suited for your data. In the case of image the pixel intensity range is bound by 0 and 1(mean =0 and variance =1).

Related

What is the reason for very high variations in val accuracy for multiple model runs?

I have a 2 layered Neural Network that I'm training on about 10000 features (genomic data) with about 100 samples in my data set. Now I realized that anytime I run my model (i.e. compile & fit) I get varying validation/testing accuracys even if I leave the train/test/validation split untouched. Sometimes its around 70% sometimes around 90%.
Due to the stochastic nature of the NN I anticipate some variation but could these strong fluctuations be a sign of something else?
The reason why you're seeing such a big instability with your validation accuracy is because your neural network is huge in comparison to the data you train it on.
Even with just 12 neurons per layer, you still have 12 * 10000 + 12 = 120012 parameters in your first layer. Now think about what the neural network does under the hood. It takes your 10000 inputs, it multiplies each input by some weight and then sums all these inputs. Now you provide it only 64 training examples on which the training algorithm is supposed to decide what are the correct input weights. Just based on intuition, from a purely combinatorial perspective there is going to be large amount of weight assignments that do well on your 64 training samples. And you have no guarantee that the training algorithm will pick such weight assignment that will also do well on your out-of-sample data.
Given neural network is able to represent a wide variety of functions (it's been proven that under certain assumptions it can approximate any function, that's called general approximation). To select the function you want you provide the training algorithm with data to constrain the space of all possible functions the network can represent to a subspace of functions that fit your data. However, such function is in no way guaranteed to represent the true underlying relationship between the input and the output. And especially if the number of parameters is larger than the number of samples (in this case by a few orders of magnitude), you're nearly guaranteed to see your network simply memorize the samples in your training data, simply because it has the capacity to do so and you haven't constrained it enough.
In other words, what you're seeing is overfitting. In NNs, the general rule of thumb is that you want at least a couple of times more samples than you have parameters (look in to the Hoeffding Inequality for theoretical rationale of this) and in effect the more samples you have, the less you're afraid of overfitting.
So here is a couple of possible solutions:
Use an algorithm that's more suitable for the case where you have high input dimension and low sample count, such as Kernel SVM (Support Vector Machine). With such a low sample count, it's quite possible that a Kernel SVM algorithm will achieve better and more consistent validation accuracy. (You can easily test this, they are available in the scikit-learn package, really easy to use)
If you insist on using NN - use regularization. Given the fact you already have working code, this will be easy, just add kernel_regularizer to all your layers, I would try both L1 and L2 regularization (probably separately). L1 regularization tends to push weights to zero so it might help reduce the number of parameters in your problem. L2 just tries to make all the weights small. Use your validation set to decide the best value for each regularization. You can optimize both for the best mean accuracy and also the lowest variance in accuracy on your validation data (do something like 20 training runs for each parameter value of L1 and L2 regularization, usually just trying different orders of magnitude is sufficient, e.g. 1e-4, 1e-3, 1e-2, 1e-1, 1, 1e1).
If most of your input features are not really predictive or if they are highly correlated, PCA (Principal Component Analysis) can be used to project your inputs into a much lower dimensional space (e.g. from 10000 to 20), where you'd have much smaller neural network (still I'd use L1 or L2 for regularization because even then you'd have more weights than training samples)
On a final note, the point of a testing set is to use it very sparsely (ideally only once). It should be the final reported metric after all your research and model tuning is done. You should not optimize any values on it. You should do all this on your validation set. To avoid overfitting on your validation set, look into k-fold cross validation.

Large trainable embedding layer slows down training

I am training a network to classify text with a LSTM. I use a randomly initialized and trainable embedding layer for the word inputs. The network is trained with the Adam Optimizer and the words are fed into the network with a one-hot-encoding.
I noticed that the number of words which are represented in the embedding layer influences heavily the training time, but I don't understand why. Increasing the number of words in the network from 200'000 to 2'000'000 almost doubled the time for a training epoch.
Shouldn't the training only update weights which where used during the prediction of the current data point. Thus if my input sequence has always the same length, there should always happen the same number of updates, regardless of the size of the embedding layer.
The number of updates needed would be reflected in the number of epochs it takes to reach a certain precision.
If your observation is that convergence takes the same number of epochs, but each epoch takes twice as much wall clock time, then it's an indication that simply performing the embedding lookup (and writing the update of embedding table) now takes a significant part of your training time.
Which could easily be the case. 2'000'000 words times 4 bytes per float32 times the length of your embedding vector (what is it? let's assume 200) is something like 1.6 gigabytes of data that needs to be touched every minibatch. You're also not saying how you're training this (CPU, GPU, what GPU) which has a meaningful impact on how this should turn out because of e.g. cache effects, as for CPU doing the exact same number of reads/writes in a slightly less cache-friendly manner (more sparsity) can easily double the execution time.
Also, your premise is a bit unusual. How much labeled data do you have that would have enough examples of the #2000000th rarest word to calculate a meaningful embedding directly? It's probably possible, but would be unusual, in pretty much all datasets, including very large ones, the #2000000th word would be a nonce and thus it'd be harmful to include it in trainable embeddings. The usual scenario would be to calculate large embeddings separately from large unlabeled data and use that as a fixed untrainable layer, and possibly concatenate them with small trainable embeddings from labeled data to capture things like domain-specific terminology.
If I understand correctly, your network takes one-hot vectors representing words to embeddings of some size embedding_size. Then the embeddings are fed as input to an LSTM. The trainable variables of the network are both those of the embedding layer and the LSTM itself.
You are correct regarding the update of the weights in the embedding layer. However, the number of weights in one LSTM cell depends on the size of the embedding. If you look for example at the equation for the forget gate of the t-th cell,
you can see that the matrix of weights W_f is multiplied by the input x_t, meaning that one of the dimensions of W_f must be exactly embedding_size. So as embedding_size grows, so does the network size, so it takes longer to train.

Predict all probable trajectories in a grid structure using Keras

I'm trying to predict sequences of 2D coordinates. But I don't want only the most probable future path but all the most probable paths to visualize it in a grid map.
For this I have traning data consisting of 40000 sequences. Each sequence consists of 10 2D coordinate pairs as input and 6 2D coordinate pairs as labels.
All the coordinates are in a fixed value range.
What would be my first step to predict all the probable paths? To get all probable paths I have to apply a softmax in the end, where each cell in the grid is one class right? But how to process the data to reflect this grid like structure? Any ideas?
A softmax activation won't do the trick I'm afraid; if you have an infinite number of combinations, or even a finite number of combinations that do not already appear in your data, there is no way to turn this into a multi-class classification problem (or if you do, you'll have loss of generality).
The only way forward I can think of is a recurrent model employing variational encoding. To begin with, you have a lot of annotated data, which is good news; a recurrent network fed with a sequence X (10,2,) will definitely be able to predict a sequence Y (6,2,). But since you want not just one but rather all probable sequences, this won't suffice. Your implicit assumption here is that there is some probability space hidden behind your sequences, which affects how they play out over time; so to model the sequences properly, you need to model that latent probability space. A Variational Auto-Encoder (VAE) does just that; it learns the latent space, so that during inference the output prediction depends on sampling over that latent space. Multiple predictions over the same input can then result in different outputs, meaning that you can finally sample your predictions to empirically approximate the distribution of potential outputs.
Unfortunately, VAEs can't really be explained within a single paragraph over stackoverflow, and even if they could I wouldn't be the most qualified person to attempt it. Try searching the web for LSTM-VAE and arm yourself with patience; you'll probably need to do some studying but it's definitely worth it. It might also be a good idea to look into Pyro or Edward, which are probabilistic network libraries for python, better suited to the task at hand than Keras.

Approximating multidimensional functions with neural networks

Is it possible to fit or approximate multidimensional functions with neural networks?
Let's say I want to model the function f(x,y)=sin(x)+y from some given measurement data. (f(x,y) is considered as ground truth and is not known). Also if it's possible some code examples written in Tensorflow or Keras would be great.
As said by #AndreHolzner, theoretically you can approximate any continuous function with a neural network as well as you want, on any compact subset of R^n, even with only one hidden layer.
However, in practice, the neural net can have to be very large for some functions, and sometimes be untrainable (the optimal weights may be hard to find without getting in a local minimum). So here are a few practical suggestions (unfortunately vague, because the details depend too much on your data and are hard to predict without multiple tries):
Keep the network not too big (it'hard to define though, unfortunately): you'll just overfit. You'll probably need a LOT of training samples.
A big number of reasonably-sized layers is usually better than a reasonable number of big layers.
If you have some priors about the function, use them: for instance, if you believe there is some kind of periodicity in f (like in your example, but it could be more complicated), you could add the sin() function to some of of the outputs of the first layer (not all, that would give you a truly periodic output). If you suspect a polynom of degree n, just augment you input x with x², ...x^n and use a linear regression on that input, etc. It will be much easier than learning the weights.
The universal approximator theorem is true on any compact subset of R^n, not on the entire multidimensional space. In particular, you'll never be able to predict the value for an input that's way bigger than any of the training samples for instance (say you trained on numbers from 0 to 100, don't test on 200, it will fail).
For an example of regression you can look here for instance. To regress a more complicated function you'd need to put more complicated functions to get pred from x, for instance like this:
n_layers = 3
x = tf.placeholder(shape=[-1, n_dimensions], dtype=tf.float32)
last_layer = x
# Add n_layers dense hidden layers
for i in range(n_layers):
last_layer = tf.layers.dense(inputs=last_layer, units=128, activation=tf.nn.relu)
# Get the output prediction
pred = tf.layers.dense(inputs=last_layer, units=1, activation=None)
# Get the cost, training op, etc, just like in the linear regression example

TensorFlow - Batch normalization failing on regression?

I'm using TensorFlow for a multi-target regression problem. Specifically, in a convolutional network with pixel-wise labeling with the input being an image and the label being a "heat-map" where each pixel has a float value. More specifically, the ground truth labeling for each pixel is lower bounded by zero, and, while technically having no upper bound, usually gets no larger than 1e-2.
Without batch normalization, the network is able to give a reasonable heat-map prediction. With batch normalization, the network takes much long to get to reasonable loss value, and the best it does is making every pixel the average value. This is using the tf.contrib.layers conv2d and batch_norm methods, with the batch_norm being passed to the conv2d's normalization_fn (or not in the case of no batch normalization). I had briefly tried batch normalization on another (single value) regression network, and had trouble then as well (though, I hadn't tested that as extensively). Is there a problem using batch normalization on regression problems in general? Is there a common solution?
If not, what could be some causes batch normalization failing on such an application? I've attempted a variety of initializations, learning rates, etc. I would expect the final layer (which of course does not use batch normalization) could use weights to scale the output of the penultimate layer to the appropriate regression values. Failing that, I removed batch norm from that layer, but with no improvement. I've attempted a small classification problem using batch normalization and saw no problem there, so it seems reasonable that it could be due somehow to the nature of the regression problem, but I don't know how that could cause such a drastic difference. Is batch normalization known to have trouble on regression problems?
I believe your issue is in the labels. Batch norm will scale all input values between 0 and 1. If the labels are not scaled to a similar range the task will be more difficult. This is because it requires the NN to learn values of a different scale.
By removing the batch norm from the penultimate layer, the task may be improved slightly, but you are still requiring an NN layer to learn to downscale values of its input while subsequently normalizing back to the range 0 - 1 (opposite to your objective).
To solve this problem, apply a 0 - 1 scaler to the labels such that your upper bound is no longer 1e-2. During inference, transform the predictions back with the same function to get the actual prediction.