How am I getting 92% accuracy after initialising parameters with zeros in a simple one layer neural network? - tensorflow

This is from one of the tensorflow examples mnist_softmax.py.
Even though the gradients are non-zero, they must be identical and all the ten weight vectors corresponding to the ten classes should be exactly same and produce the same output logits and hence same probabilities. The only case I could think this is possible is while calculating the accuracy using tf.argmax(), whose output is ambiguous in case of ties, we are getting lucky and resulting in 92% accuracy. But then I checked the values of y after training is complete and they give perfectly different outputs indicating the weight vectors of all classes are not same. Can someone explain how this is possible?

Although it is best to initialize the parameters to small random numbers to break symmetry and possibly accelerate learning, it does not necessarily mean you will get same probabilities for all classes if you initialize the weights to zeros.
The reason is because the cross_entropy function is a function of weights, inputs, and correct class labels. So the gradient will be different for each output 'neuron', depending on the correct class label, and this will break the symmetry.

Related

Binary classification of pairs with opposite labels

I have a data-set without labels, but I do have a way to get pairs of examples with opposite labels, that is given a pair x,z I know that their true labels are either 0,1 or 1,0.
So, I am building a model that accepts pairs of samples as input, and learns to classify them with opposite labels. Assuming I have an arbitrary model for predicting a single sample, y_hat = f(x), I am building a model with Keras that accepts pairs of samples (x,z) and outputs pairs of predictions, f(x), f(z). I then use a custom loss function that drives the model towards the correct direction: Given that a regular binary classifier is trained using the Binary Cross Entropy (BCE) to make the predicted and desired output "close", I use the negative BCE. Also, since BCE is not symmetric, I symmetrize it. So, the loss function I give the model.compile method is:
from tensorflow import keras
bce = keras.losses.BinaryCrossentropy()
def neg_sym_bce(y1, y2):
return (- 0.5 * (bce(y1, y2) + bce(y2, y1)))
My problem is, this model fails to learn to classify even a single pair of my data (I get f(x)~=f(z)~=0.5), and if I try to train it with synthetic "easy" data, it takes hundreds of epochs to converge (also on a single pair).
This made me suspect that it has to do with a "vanishing gradient" problem. Indeed, when I plot (see below) the loss for a single pair, which is a function of 2 variables (the 2 outputs), it is evident that there is a wide plateau around the 0.5, 0.5 point. It is also evident that the global minima is, as expected, around the points 0,1 and 1,0.
So, is there a way to deal with the vanishing gradient here? I read about the problem but the references I found deal with vanishing gradient in the network, not in the loss itself.
Or, is there another loss that can drive the model to predict opposite labels?
Think if your labels are always either 0,1 or 1,1 just use categorical_crossentropy for the loss.

For what are responsible weights?

I'm reading the google ML crash course and have one question.
What is a weight? (I understand that this is a slope in a plot, but it doesn't fit into my understanding)
I also don't understand an impact of weight on the model prediction (for example, in this playground)
Many thanks for the help.
Every layer in a model is a huge mathematical function with many "unknown" variables.
When you build a model, you build a monster function (with thousands or millions of unknown variables) that gives an output from an input.
Something like this:
output_tensor = huge_function(your_input_tensor,var1,var2,var3,var4.......,var10000000)
These variables are the weights. At the beginning, they receive random values, and obviously your function gives you terrible results.
As you train, you adjust the values of these variables so that your results improve.
Weights are such variables, the ones in the model that you are going to adjust so that your huge function brings you good results.
Weights x Biases
Depending on what you are reading, or what program you're using, they will be called weights. According to what I wrote above, both fit the description.
But usually:
Weights - Multiply the inputs
Biases - Are added to the multiplied outputs
So, the usual layers (with some important differences, of course), perform operations like:
output_matrix = input_matrix x weights + biases
Nothing prevents you from creating custom operations, though, where your variables/weights neither multiply nor add.

Keras model returns different values

To play with data, I have trained a linear regression with Keras+TensorFlow, and compared the first prediction computed in 3 different ways:
I got the weights from the model, and just used the linear regression formula p = w*X0 + b
I got predictions using the model.predict(X) method of Keras for the whole data array X and then took only the first element of it
I got prediction using the same method only for the first row of features X0 (the first sample)
In theory, all those methods should produce the very same value. However, in practice I do get values that are a bit different.
This difference is not that big, but still I wonder why is that the case, only due to float precision in python?
This is most likely due to the fact that matrix multiplications and convolutions are implemented in a way which is non-deterministic (if you change the batch size you change the order in which multiply-adds happen and since floating point numbers are not associative you get slightly different results).

Convolutional Neural Network Training

I have a question regarding convolutional neural network (CNN) training.
I have managed to train a network using tensorflow that takes an input image (1600 pixels) and output one of three classes that matches it.
Testing the network with variations of the trained classes is giving good results. However; when I give it a different -fourth- image (does not contain any of the trained 3 image), it always returns a random match to one of the classes.
My question is, how can I train a network to classify that the image does not belong to either of the three trained images? A similar example, if i trained a network against the mnist database and then a gave it the character "A" or "B". Is there a way to discriminate that the input does not belong to either of the classes?
Thank you
Your model will always make predictions like your labels, so for example if you train your model with MNIST data, when you will make predictions, prediction will always be 0-9 just like MNIST labels.
What you can do is train a different model first with 2 classes in which you will predict if an image belongs to data set A or B. E.x. for MNIST data you label all data as 1 and add data from other sources that are different (not 0-9) and label them as 0. Then train a model to find if image belongs to MNIST or not.
Convolutional Neural Network (CNN) predicts the result from the defined classes after training. CNN always return from one of the classes regardless of accuracy. I have faced similar problem, what you can do is to check for accuracy value. If the accuracy is below some threshold value then it's belong to none category. Hope this helps.
You probably have three output nodes, and choose the maximum value (one-hot encoding). That's a bit unfortunate as it's a low number of outputs. Non-recognized inputs tend to cause pretty random outputs.
Now, with 3 outputs, roughly speaking you can get 7 outcomes. You might get a single high value (3 possibilities) but non-recognized input can also cause 2 high outputs (also 3 possibilities) or approximately equal output (also 3 possibilities). So there's a decent chance (~ 3/7) of random inputs producing a pattern on the output nodes which you'd only expect for a recognized input.
Now, if you had 15 classes and thus 15 output nodes, you'd be looking at roughly 32767 possible outcomes for unrecognized inputs, only 15 of which correspond to expected one-hot outcomes.
Underlying this is a lack of training data. If your training set has examples outside the 3 classes, you can just dump this in a 4th "other" category and train with that. This by itself isn't a reliable indication, as usually the theoretical "other" set is huge, but you now have 2 complementary ways of detecting other inputs: either by the "other" output node or by one of the 11 ambiguous outputs.
Another solution would be to check what outcome your CNN usually gives when given something else. I believe the last layer must be softmax and your CNN should return probabilities of the three given classes. If none of these probabilities is close to 1 this might be a sign that this is something else assuming your CNN is well trained (it must be fined for overconfidence when predicting wrong labels).

dropout with relu activations

I am trying to implement a neural network with dropout in tensorflow.
tf.layers.dropout(inputs, rate, training)
From the documentation: "Dropout consists in randomly setting a fraction rate of input units to 0 at each update during training time, which helps prevent overfitting. The units that are kept are scaled by 1 / (1 - rate), so that their sum is unchanged at training time and inference time."
Now I understand that this behavior if dropout is applied on top of sigmoid activations that are strictly above zero. If half of the input units are zeroed, the sum of all the outputs will be also halved so it makes sense to scale them by factor of 2 in order to regain some kind of consistency before the next layer.
Now what if one uses the tanh activation which is centered around zero? The reasoning above no longer holds true so is it still valid to scale the output of dropout by the mentioned factor? Is there a way to prevent tensorflow dropout from scaling the outputs?
Thanks in advance
If you have a set of inputs to a node and a set of weights, their weighted sum is a value, S. You can define another random variable by selecting a random fraction f of the original random variables. The weighted sum using the same weights of the random variable defined in this way is S * f. From this, you can see the argument for rescaling is precise if the objective is that the mean of the sum remains the same with and without scaling. This would be true when the activation function is linear in the range of the weighted sums of subsets, and approximately true if the activation function is approximately linear in the range of the weighted sum of subsets.
After passing the linear combination through any non-linear activation function, it is no longer true that rescaling exactly preserves the expected mean. However, if the contribution to a node is not dominated by a small number of nodes, the variance in the sum of a randomly selected subset of a chosen, fairly large size will be relatively small, and if the activation function is approximately linear fairly near the output value, rescaling will work well to produce an output with approximately the same mean. Eg the logistic and tanh functions are approximately linear over any small region. Note that the range of the function is irrelevant, only the differences between its values.
With relu activation, if the original weighted sum is close enough to zero for the weighted sum of subsets to be on both sides of zero, a non-differentiable point in the activation function, rescaling won't work so well, but this is a relatively rare situation and limited to outputs that are small, so may not be a big problem.
The main observations here are that rescaling works best with large numbers of nodes making significant contributions, and relies on local approximate linearity of activation functions.
The point of setting the node to have an output of zero is so that neuron would have no effect on the neurons being fed by it. This would create sparsity and hence, attempts to reduce overfitting. When using sigmoid or tanh, the value is still set to zero.
I think your approach of reasoning here is incorrect. Think of contribution rather than sum.