I'm running into a situation where giving a neural network extra data reduces accuracy, and I can't see how that's possible.
Suppose you train a neural network - just a binary classifier - on a set of examples that have, say, 10 variables each. And it learns to classify both training and tests sets quite accurately. Then rerun with the same examples but extra variables on each example, say an extra 20 variables each. Maybe the extra variables don't give as good a signal as the original ones, but it's still getting the original variables too. Worst-case scenario, it should just take a bit more time learning to ignore the extra variables, right? On the face of it, there shouldn't be any way for the accuracy to be less?
To go through everything I can think of:
It's the same set of records in each case.
All the original variables are still there, just with some extra ones added.
It's not about overfitting; the network trained with the extra data is much less accurate on both the training and test sets.
I don't think it's about needing more time. It's been running for a long time now and showing no signs of making progress.
I've tried with the learning rate both unchanged and reduced, same result each way.
Using TensorFlow, simple feedforward network with one hidden layer, Adam optimizer. Code is at https://github.com/russellw/tf-examples/blob/master/multilayer.py and the most important section is
# Inputs and outputs
X = tf.placeholder(dtype, shape=(None, cols))
Y = tf.placeholder(dtype, shape=None,)
# Hidden layers
n1 = 3
w1 = tf.Variable(rnd((cols, n1)), dtype=dtype)
b1 = tf.Variable(rnd(n1), dtype=dtype)
a1 = tf.nn.sigmoid(tf.matmul(X, w1) + b1)
pr('layer 1: {}', n1)
# Output layer
no = 1
wo = tf.Variable(rnd((n1, no)), dtype=dtype)
bo = tf.Variable(rnd(no), dtype=dtype)
p = tf.nn.sigmoid(tf.squeeze(tf.matmul(a1, wo)) + bo)
tf.global_variables_initializer().run()
# Model
cost = tf.reduce_sum((p-Y)**2/rows)
optimizer = tf.train.AdamOptimizer(args.learning_rate).minimize(cost)
tf.global_variables_initializer().run()
How is it possible for the extra data to make the network less accurate? What am I missing?
You are confusing between variables and (training) data. Variables are something you use to find, say weights and biases, that help the network learn from data. So by increasing variables you are increasing trainable units, which warrants more time obviously. After a certain threshold, your data might become insufficient for the network to learn/update these variables.
Extra data means simply more examples.
So in your case it seems that you've crossed that threshold (assuming you've waited for your NN learning, long enough before saying it's not learning anymore).
You may want to know a phenomenon called curse of dimensionality. From Wikipedia, it says:
The common theme of these problems is that when the dimensionality increases, the volume of the space increases so fast that the available data become sparse.
https://en.wikipedia.org/wiki/Curse_of_dimensionality
It turns out to have at least something to do with the properties of the optimizer. Though Adam worked the best of all the optimizers in one case, in a slightly different case it fails badly where Ftrl solves the problem. I don't know why Adam has this failure mode, but current solution: make optimizer a parameter, use a batch file to loop through all of them.
Related
I am training a neural network by SGD (batch size = 1). The inputs are randomly generated, and the labels are calculated based on the input. AKA the data does not have to be realistic, but the relationships between inputs and labels are specific. I will train my NN only 1 epoch, but with many batches.
I have the following code:
training_input = tf.Variable(tf.zeros(...))
assign_training_input_with_random_values = training_input.assign(tf.random_normal(...))
//Create a session, initialize a bunch of variables, construct a neural network...
for batch in range(batch_number):
sess.run(assign_training_input_with_random_values)
//Train my neural network...
However I noticed that if I write the above code differently the speed goes down by a lot:
//Run the assignment operation directly without defining it as a variable
for batch in range(batch_size)
sess.run(training_input.assign(tf.random_normal(...)))
//Train my neural network...
The first snippet being significantly faster makes me worry that tensorflow is only randomizing when I define the assign_training_input_with_random_values variable, and the same training examples are fed to the NN over every batch afterwards. In this case, the NN will probably not generalize well. Meanwhile, the second snippet is slow because it is randomizing every batch. Is this actually the case or is there another reason for this?
First the explanation to your observations
Computational difference between 1st and 2nd solutions
It makes sense that your first solution is faster than the second. You define the assign operation once and then execute that for 100 epochs. However in the 2nd solution you create an op every epoch, growing the computational graph over time which causes your program to slow down.
Observation about the 1st solution
(After #Y.Z.'s finding) Apparently the first solution does evaluate to different random number arrays every time you run it. Therefore, the first solution is also valid.
Another way to implement this
The correct way to implement your solution would be to use a tf.placeholder to feed values in every epoch the following way.
import tensorflow as tf
import numpy as np
training_input = tf.Variable(tf.zeros(shape=[3, 2]))
tf_random = tf.placeholder(shape=[3, 2], dtype=tf.float32)
assign_training_input_with_random_values = training_input.assign(tf_random)
#Create a session, initialize a bunch of variables, construct a neural network...
epoch=0
with tf.Session() as sess:
while epoch < 10:
epoch+= 1
sess.run(assign_training_input_with_random_values, feed_dict={tf_random:np.random.normal(size=(3,2))})
Comparing Solution 1 vs My solution
So turns out, both your first solution and my solution will not grow the graph. If you run the line
print([n.name for n in tf.get_default_graph().as_graph_def().node])
for your first solution and my solution (Be careful to run tf.reset_default_graph() at the beginning) you'll see that the number of tensors remain constant regardless of the number of iterations. Appears that TensorFlow is smart enough to prune those old tf.random tensors no longer used.
So far I have trained a couple different models in TensorFlow (with Keras) and I see that getting the batch_size right seems to be important not just for speed of training but also the resultant accuracy of the model.
What confuses me is a case where a model has an actual batch channel as the first dimension on the input (and output as well). If my batch size is 32 but I'm always inputting 1 data at run-time then where does the batch channel apply? How could I utilize the vast majority of it if I'm inherently only using 1/batch_size amount of it in forward pass?
If you are curious the model I am researching, it is this one:
https://github.com/pierluigiferrari/ssd_keras/blob/master/models/keras_ssd300.py
see:
Output shape of predictions: (batch, n_boxes_total, n_classes + 4 + 8)
predictions = Concatenate(axis=2, name='predictions')([mbox_conf_softmax, mbox_loc, mbox_priorbox])
The tensors had run through numerous other layers that had constants and such pretrained with [batch_size] as well. To me it just seems like inputs at various batch index would have to yield different results. Maybe I just need something incredibly obvious pointed out to me.
It would seem that after training you must recompile the model with a batch size of 1, then transfer the weights from the training model to the new model for evaluation. The alternative is performing 'batch_size' count of predictions at once (which of course is not always feasible per application). If there are alternatives (or if I read wrong) please feel free to add an answer.
I'm reading the google ML crash course and have one question.
What is a weight? (I understand that this is a slope in a plot, but it doesn't fit into my understanding)
I also don't understand an impact of weight on the model prediction (for example, in this playground)
Many thanks for the help.
Every layer in a model is a huge mathematical function with many "unknown" variables.
When you build a model, you build a monster function (with thousands or millions of unknown variables) that gives an output from an input.
Something like this:
output_tensor = huge_function(your_input_tensor,var1,var2,var3,var4.......,var10000000)
These variables are the weights. At the beginning, they receive random values, and obviously your function gives you terrible results.
As you train, you adjust the values of these variables so that your results improve.
Weights are such variables, the ones in the model that you are going to adjust so that your huge function brings you good results.
Weights x Biases
Depending on what you are reading, or what program you're using, they will be called weights. According to what I wrote above, both fit the description.
But usually:
Weights - Multiply the inputs
Biases - Are added to the multiplied outputs
So, the usual layers (with some important differences, of course), perform operations like:
output_matrix = input_matrix x weights + biases
Nothing prevents you from creating custom operations, though, where your variables/weights neither multiply nor add.
Is it possible to fit or approximate multidimensional functions with neural networks?
Let's say I want to model the function f(x,y)=sin(x)+y from some given measurement data. (f(x,y) is considered as ground truth and is not known). Also if it's possible some code examples written in Tensorflow or Keras would be great.
As said by #AndreHolzner, theoretically you can approximate any continuous function with a neural network as well as you want, on any compact subset of R^n, even with only one hidden layer.
However, in practice, the neural net can have to be very large for some functions, and sometimes be untrainable (the optimal weights may be hard to find without getting in a local minimum). So here are a few practical suggestions (unfortunately vague, because the details depend too much on your data and are hard to predict without multiple tries):
Keep the network not too big (it'hard to define though, unfortunately): you'll just overfit. You'll probably need a LOT of training samples.
A big number of reasonably-sized layers is usually better than a reasonable number of big layers.
If you have some priors about the function, use them: for instance, if you believe there is some kind of periodicity in f (like in your example, but it could be more complicated), you could add the sin() function to some of of the outputs of the first layer (not all, that would give you a truly periodic output). If you suspect a polynom of degree n, just augment you input x with x², ...x^n and use a linear regression on that input, etc. It will be much easier than learning the weights.
The universal approximator theorem is true on any compact subset of R^n, not on the entire multidimensional space. In particular, you'll never be able to predict the value for an input that's way bigger than any of the training samples for instance (say you trained on numbers from 0 to 100, don't test on 200, it will fail).
For an example of regression you can look here for instance. To regress a more complicated function you'd need to put more complicated functions to get pred from x, for instance like this:
n_layers = 3
x = tf.placeholder(shape=[-1, n_dimensions], dtype=tf.float32)
last_layer = x
# Add n_layers dense hidden layers
for i in range(n_layers):
last_layer = tf.layers.dense(inputs=last_layer, units=128, activation=tf.nn.relu)
# Get the output prediction
pred = tf.layers.dense(inputs=last_layer, units=1, activation=None)
# Get the cost, training op, etc, just like in the linear regression example
I am building a simple linear regressor for the data from the csv. Data includes weight and height values of some people. Overall learning process is very simple:
MAX_STEPS = 2000
# ...
features = [tf.contrib.layers.real_valued_column(feature_name) for feature_name in FEATURES_COL]
# ...
linear_regressor = tf.contrib.learn.LinearRegressor(feature_columns=features)
linear_regressor.fit(input_fn=prepare_input, max_steps=MAX_STEPS)
However, the model that is built by the regressor is, unexpectedly, bad. Result could be illustrated with the next picture:
Visualization code(just in case):
plt.plot(height_and_weight_df_filtered[WEIGHT_COL],
linear_regressor.predict(input_fn=prepare_full_input),
color='blue',
linewidth=3)
Here is the same data been given to the LinearRegression class from the scikit-learn:
lr_updated = linear_model.LinearRegression()
lr_updated.fit(weight_filtered_reshaped, height_filtered)
And the visualization:
Increasing amount of steps has no effect. I would assume I'm using regressor from the TensorFlow in a wrong way.
iPython notebook with the code.
It looks like your TF model does indeed work and will get there with enough steps. You need to jack it right up though - 200K showed significant improvement, almost as good as the sklearn default.
I think there are two issues:
sklearn looks like it simply solves the equation using ordinary least squares. TF's LinearRegressor uses the FtrlOptimizer. The paper indicates it is a better choice for very large datasets.
The input_fn to the model is injecting the whole training set at once, for every step. This is just a hunch, but I suspect that the FtrlOptimizer may do better if it sees batches at a time.
Instead of just changing the number of steps up a couple orders of magnitude, you can also jack the learning rate up on the optimizer (the default is 0.2) and get similarly good results from only 4k steps:
linear_regressor = tf.contrib.learn.LinearRegressor(
feature_columns=features,
optimizer=tf.train.FtrlOptimizer(learning_rate=5.0))
I met a similar problem. The solution is to check if your input_fn has enough epoch. The training maybe not converge before iterating over the whole training data several times.