For what are responsible weights? - tensorflow

I'm reading the google ML crash course and have one question.
What is a weight? (I understand that this is a slope in a plot, but it doesn't fit into my understanding)
I also don't understand an impact of weight on the model prediction (for example, in this playground)
Many thanks for the help.

Every layer in a model is a huge mathematical function with many "unknown" variables.
When you build a model, you build a monster function (with thousands or millions of unknown variables) that gives an output from an input.
Something like this:
output_tensor = huge_function(your_input_tensor,var1,var2,var3,var4.......,var10000000)
These variables are the weights. At the beginning, they receive random values, and obviously your function gives you terrible results.
As you train, you adjust the values of these variables so that your results improve.
Weights are such variables, the ones in the model that you are going to adjust so that your huge function brings you good results.
Weights x Biases
Depending on what you are reading, or what program you're using, they will be called weights. According to what I wrote above, both fit the description.
But usually:
Weights - Multiply the inputs
Biases - Are added to the multiplied outputs
So, the usual layers (with some important differences, of course), perform operations like:
output_matrix = input_matrix x weights + biases
Nothing prevents you from creating custom operations, though, where your variables/weights neither multiply nor add.

Related

Is there a way to retrieve the weights from a GPflow GPR model?

Is there a way to retrieve the weights from a GPflow GPR model?
I do not necessarily need the explicit weights. However, I have two issues that may be solved using the weights:
I would like to compile and send a trained model to a third party. I
would like to do this without sending the training data and without
the third party having access to the training data.
I would like to be able to predict new mean values without
calculating new variances. Currently predict_f calculates both the
mean and the variance, but I only use the mean. I believe I could
speed up my prediction significantly if I didn't calculate the
variance.
I could resolve both of these issues if I could retrieve the weights from the GPR model after training. However, if it is possible to resolve these tasks without ever dealing with explicit weights, that would be even better.
It's not entirely clear what you mean by "explicit weights", but if you mean alpha = Kxx^{-1} y where Kxx is the evaluation of k(x,x') and y is the vector of observation targets, then you can get that by using the Posterior object (see https://github.com/GPflow/GPflow/blob/develop/gpflow/posteriors.py), which you get by calling posterior = model.posterior(). You can then access posterior.alpha.
Re 1.: However, for predictions you still need to be able to compute Kzx the covariance between new test points and the training points, so you will also need to provide the training locations and kernel hyperparameters.
This also means that you cannot rely on this to keep your training data secret, as the third party could simply compute Kxx instead of Kzx and then get back y = Kxx # alpha. You can avoid sharing exact (x,y) training set pairs by using a sparse approximation (this would remove "individual identifiability" at least). But I still wouldn't rely on it for privacy.
Re 2.: The Posterior object already provides much faster predictions; if you only ask for full_cov=False (marginal variances, the default), then you're at worst about a factor ~3 or so slower than predicting just the mean (in practice, I would guesstimate less than 1.5x as slow). As of GPflow 2.3.0, there is no implementation within GPflow of predicting the mean only.

Predict all probable trajectories in a grid structure using Keras

I'm trying to predict sequences of 2D coordinates. But I don't want only the most probable future path but all the most probable paths to visualize it in a grid map.
For this I have traning data consisting of 40000 sequences. Each sequence consists of 10 2D coordinate pairs as input and 6 2D coordinate pairs as labels.
All the coordinates are in a fixed value range.
What would be my first step to predict all the probable paths? To get all probable paths I have to apply a softmax in the end, where each cell in the grid is one class right? But how to process the data to reflect this grid like structure? Any ideas?
A softmax activation won't do the trick I'm afraid; if you have an infinite number of combinations, or even a finite number of combinations that do not already appear in your data, there is no way to turn this into a multi-class classification problem (or if you do, you'll have loss of generality).
The only way forward I can think of is a recurrent model employing variational encoding. To begin with, you have a lot of annotated data, which is good news; a recurrent network fed with a sequence X (10,2,) will definitely be able to predict a sequence Y (6,2,). But since you want not just one but rather all probable sequences, this won't suffice. Your implicit assumption here is that there is some probability space hidden behind your sequences, which affects how they play out over time; so to model the sequences properly, you need to model that latent probability space. A Variational Auto-Encoder (VAE) does just that; it learns the latent space, so that during inference the output prediction depends on sampling over that latent space. Multiple predictions over the same input can then result in different outputs, meaning that you can finally sample your predictions to empirically approximate the distribution of potential outputs.
Unfortunately, VAEs can't really be explained within a single paragraph over stackoverflow, and even if they could I wouldn't be the most qualified person to attempt it. Try searching the web for LSTM-VAE and arm yourself with patience; you'll probably need to do some studying but it's definitely worth it. It might also be a good idea to look into Pyro or Edward, which are probabilistic network libraries for python, better suited to the task at hand than Keras.

Added data reducing accuracy

I'm running into a situation where giving a neural network extra data reduces accuracy, and I can't see how that's possible.
Suppose you train a neural network - just a binary classifier - on a set of examples that have, say, 10 variables each. And it learns to classify both training and tests sets quite accurately. Then rerun with the same examples but extra variables on each example, say an extra 20 variables each. Maybe the extra variables don't give as good a signal as the original ones, but it's still getting the original variables too. Worst-case scenario, it should just take a bit more time learning to ignore the extra variables, right? On the face of it, there shouldn't be any way for the accuracy to be less?
To go through everything I can think of:
It's the same set of records in each case.
All the original variables are still there, just with some extra ones added.
It's not about overfitting; the network trained with the extra data is much less accurate on both the training and test sets.
I don't think it's about needing more time. It's been running for a long time now and showing no signs of making progress.
I've tried with the learning rate both unchanged and reduced, same result each way.
Using TensorFlow, simple feedforward network with one hidden layer, Adam optimizer. Code is at https://github.com/russellw/tf-examples/blob/master/multilayer.py and the most important section is
# Inputs and outputs
X = tf.placeholder(dtype, shape=(None, cols))
Y = tf.placeholder(dtype, shape=None,)
# Hidden layers
n1 = 3
w1 = tf.Variable(rnd((cols, n1)), dtype=dtype)
b1 = tf.Variable(rnd(n1), dtype=dtype)
a1 = tf.nn.sigmoid(tf.matmul(X, w1) + b1)
pr('layer 1: {}', n1)
# Output layer
no = 1
wo = tf.Variable(rnd((n1, no)), dtype=dtype)
bo = tf.Variable(rnd(no), dtype=dtype)
p = tf.nn.sigmoid(tf.squeeze(tf.matmul(a1, wo)) + bo)
tf.global_variables_initializer().run()
# Model
cost = tf.reduce_sum((p-Y)**2/rows)
optimizer = tf.train.AdamOptimizer(args.learning_rate).minimize(cost)
tf.global_variables_initializer().run()
How is it possible for the extra data to make the network less accurate? What am I missing?
You are confusing between variables and (training) data. Variables are something you use to find, say weights and biases, that help the network learn from data. So by increasing variables you are increasing trainable units, which warrants more time obviously. After a certain threshold, your data might become insufficient for the network to learn/update these variables.
Extra data means simply more examples.
So in your case it seems that you've crossed that threshold (assuming you've waited for your NN learning, long enough before saying it's not learning anymore).
You may want to know a phenomenon called curse of dimensionality. From Wikipedia, it says:
The common theme of these problems is that when the dimensionality increases, the volume of the space increases so fast that the available data become sparse.
https://en.wikipedia.org/wiki/Curse_of_dimensionality
It turns out to have at least something to do with the properties of the optimizer. Though Adam worked the best of all the optimizers in one case, in a slightly different case it fails badly where Ftrl solves the problem. I don't know why Adam has this failure mode, but current solution: make optimizer a parameter, use a batch file to loop through all of them.

Approximating multidimensional functions with neural networks

Is it possible to fit or approximate multidimensional functions with neural networks?
Let's say I want to model the function f(x,y)=sin(x)+y from some given measurement data. (f(x,y) is considered as ground truth and is not known). Also if it's possible some code examples written in Tensorflow or Keras would be great.
As said by #AndreHolzner, theoretically you can approximate any continuous function with a neural network as well as you want, on any compact subset of R^n, even with only one hidden layer.
However, in practice, the neural net can have to be very large for some functions, and sometimes be untrainable (the optimal weights may be hard to find without getting in a local minimum). So here are a few practical suggestions (unfortunately vague, because the details depend too much on your data and are hard to predict without multiple tries):
Keep the network not too big (it'hard to define though, unfortunately): you'll just overfit. You'll probably need a LOT of training samples.
A big number of reasonably-sized layers is usually better than a reasonable number of big layers.
If you have some priors about the function, use them: for instance, if you believe there is some kind of periodicity in f (like in your example, but it could be more complicated), you could add the sin() function to some of of the outputs of the first layer (not all, that would give you a truly periodic output). If you suspect a polynom of degree n, just augment you input x with x², ...x^n and use a linear regression on that input, etc. It will be much easier than learning the weights.
The universal approximator theorem is true on any compact subset of R^n, not on the entire multidimensional space. In particular, you'll never be able to predict the value for an input that's way bigger than any of the training samples for instance (say you trained on numbers from 0 to 100, don't test on 200, it will fail).
For an example of regression you can look here for instance. To regress a more complicated function you'd need to put more complicated functions to get pred from x, for instance like this:
n_layers = 3
x = tf.placeholder(shape=[-1, n_dimensions], dtype=tf.float32)
last_layer = x
# Add n_layers dense hidden layers
for i in range(n_layers):
last_layer = tf.layers.dense(inputs=last_layer, units=128, activation=tf.nn.relu)
# Get the output prediction
pred = tf.layers.dense(inputs=last_layer, units=1, activation=None)
# Get the cost, training op, etc, just like in the linear regression example

How am I getting 92% accuracy after initialising parameters with zeros in a simple one layer neural network?

This is from one of the tensorflow examples mnist_softmax.py.
Even though the gradients are non-zero, they must be identical and all the ten weight vectors corresponding to the ten classes should be exactly same and produce the same output logits and hence same probabilities. The only case I could think this is possible is while calculating the accuracy using tf.argmax(), whose output is ambiguous in case of ties, we are getting lucky and resulting in 92% accuracy. But then I checked the values of y after training is complete and they give perfectly different outputs indicating the weight vectors of all classes are not same. Can someone explain how this is possible?
Although it is best to initialize the parameters to small random numbers to break symmetry and possibly accelerate learning, it does not necessarily mean you will get same probabilities for all classes if you initialize the weights to zeros.
The reason is because the cross_entropy function is a function of weights, inputs, and correct class labels. So the gradient will be different for each output 'neuron', depending on the correct class label, and this will break the symmetry.