Reason why setting tensorflow's variable with small stddev - variables

I have a question about a reason why setting TensorFlow's variable with small stddev.
I guess many people do test MNIST test code from TensorFlow beginner's guide.
As following it, the first layer's weights are initiated by using truncated_normal with stddev 0.1.
And I guessed if setting it with more bigger value, then it would be the same result, which is exactly accurate.
But although increasing epoch count, it doesn't work.
Is there anybody know this reason?
original :
W_layer = tf.Variable(tf.truncated_normal([inp.get_shape()[1].value, size],stddev=0.1), name='w_'+name)
#result : (990, 0.93000001, 0.89719999)
modified :
W_layer = tf.Variable(tf.truncated_normal([inp.get_shape()[1].value, size],stddev=200), name='w_'+name)
#result : (99990, 0.1, 0.098000005)

The reason is because you want to keep all the layer's variances (or standard deviations) approximately the same, and sane. It has to do with the error backpropagation step of the learning process and the activation functions used.
In order to learn the network's weights, the backpropagation step requires knowledge of the network's gradient, a measure of how strong each weight influences the input to reach the final output; layer's weight variance directly influences the propagation of gradients.
Say, for example, that the activation function is sigmoidal (e.g. tf.nn.sigmoid or tf.nn.tanh); this implies that all input values are squashed into a fixed output value range. For the sigmoid, it is the range 0..1, where essentially all values z greater or smaller than +/- 4 are very close to one (for z > 4) or zero (for z < -4) and only values within that range tend to have some meaningful "change".
Now the difference between the values sigmoid(5) and sigmoid(1000) is barely noticeable. Because of that, all very large or very small values will optimize very slowly, since their influence on the result y = sigmoid(W*x+b) is extremely small. Now the pre-activation value z = W*x+b (where x is the input) depends on the actual input x and the current weights W. If either of them is large, e.g. by initializing the weights with a high variance (i.e. standard deviation), the result will necessarily be (relatively) large, leading to said problem. This is also the reason why truncated_normal is used rather than a correct normal distribution: The latter only guarantees that most of the values are very close to the mean, with some less than 5% chance that this is not the case, while truncated_normal simply clips away every value that is too big or too small, guaranteeing that all weights are in the same range, while still being normally distributed.
To make matters worse, in a typical neural network - especially in deep learning - each network layer is followed by one or many others. If in each layer the output value range is big, the gradients will get bigger and bigger as well; this is known as the exploding gradients problem (a variation of the vanishing gradients, where gradients are getting smaller).
The reason that this is a problem is because learning starts at the very last layer and each weight is adjusted depending on how much it contributed to the error. If the gradients are indeed getting very big towards the end, the very last layer is the first one to pay a high toll for this: Its weights get adjusted very strongly - likely overcorrecting the actual problem - and then only the "remaining" error gets propagated further back, or up, the network. Here, since the last layer was already "fixed a lot" regarding the measured error, only smaller adjustments will be made. This may lead to the problem that the first layers are corrected only by a tiny bit or not at all, effectively preventing all learning there. The same basically happens if the learning rate is too big.
Finding the best weight initialization is a topic by itself and there are somewhat more sophisticated methods such as Xavier initialization or Layer-sequential unit variance, however small normally distributed values are usually simply a good guess.

Related

What’s the advantage of using LSTM for time series predict as opposed to Regression?

In neural networks, in general, which model should yield a better and accurate output between both for time series?
As you rightly mentioned, We can use linear regression with time series data as long as:
The inclusion of lagged terms as regressors does not create a collinearity problem.
Both the regressors and the explained variable are stationary.
Your errors are not correlated with each other.
The other linear regression assumptions apply.
No autocorrelation is the single most important assumption in linear regression. If autocorrelation is present the consequences are the following:
Bias: Your “best fit line” will likely be way off because it will be pulled away from the “true line” by the effect of the lagged errors.
Inconsistency: Given the above, your sample estimators are unlikely to converge to the population parameters.
Inefficiency: While it is theoretically possible, your residuals are unlikely to be homoskedastic if they are autocorrelated. Thus, your confidence intervals and your hypothesis tests will be unreliable.
While, The Long Short Term Memory neural network is a type of a Recurrent Neural Network (RNN). RNNs use previous time events to inform the later ones. For example, to classify what kind of event is happening in a movie, the model needs to use information about previous events. RNNs work well if the problem requires only recent information to perform the present task. If the problem requires long term dependencies, RNN would struggle to model it. The LSTM was designed to learn long term dependencies. It remembers the information for long periods.
To focus on the 1st sequence. The model takes the feature of the time bar at index 0 and it tries to predict the target of the time bar at index 1. Then it takes the feature of the time bar at index 1 and it tries to predict the target of the time bar at index 2, etc. The feature of 2nd sequence is shifted by 1 time bar from the feature of 1st sequence, the feature of 3rd sequence is shifted by 1 time bar from 2nd sequence, etc. With this procedure, we get many shorter sequences that are shifted by a single time bar.

Isn't it dangerous to apply Min Max Scaling to the test set?

Here's the situation I am worrying about.
Let me say I have a model trained with min-max scaled data. I want to test my model, so I also scaled the test dataset with my old scaler which was used in the training stage. However, my new test data's turned out to be the newer minimum, so the scaler returned negative value.
As far as I know, minimum and maximum aren't that stable value, especially in the volatile dataset such as cryptocurrency data. In this case, should I update my scaler? Or should I retrain my model?
I happen to disagree with #Sharan_Sundar. The point of scaling is to bring all of your features onto a single scale, not to rigorously ensure that they lie in the interval [0,1]. This can be very important, especially when considering regularization techniques the penalize large coefficients (whether they be linear regression coefficients or neural network weights). The combination of feature scaling and regularization help to ensure your model generalizes to unobserved data.
Scaling based on your "test" data is not a great idea because in practice, as you pointed out, you can easily observe new data points that don't lie within the bounds of your original observations. Your model needs to be robust to this.
In general, I would recommend considering different scaling routines. scikitlearn's MinMaxScaler is one, as is StandardScaler (subtract mean and divide by standard deviation). In the case where your target variable, cryptocurrency price can vary over multiple orders of magnitude, it might be worth using the logarithm function for scaling some of your variables. This is where data science becomes an art -- there's not necessarily a 'right' answer here.
(EDIT) - Also see: Do you apply min max scaling separately on training and test data?
Ideally you should scale first and then only split into test and train. But its not preferable to use minmax scaler with data which can have dynamically varying min and max values with significant variance in realtime scenario.

Machine learning: why the cost function does not need to be derivable?

I was playing around with Tensorflow creating a customized loss function and this question about general machine learning arose to my head.
My understanding is that the optimization algorithm needs a derivable cost function to find/approach a minimum, however we can use functions that are non-derivable such as the absolute function (there is no derivative when x=0). A more extreme example, I defined my cost function like this:
def customLossFun(x,y):
return tf.sign(x)
and I expected an error when running the code, but it actually worked (it didn't learn anything but it didn't crash).
Am I missing something?
You're missing the fact that the gradient of the sign function is somewhere manually defined in the Tensorflow source code.
As you can see here:
def _SignGrad(op, _):
"""Returns 0."""
x = op.inputs[0]
return array_ops.zeros(array_ops.shape(x), dtype=x.dtype)
the gradient of tf.sign is defined to be always zero. This, of course, is the gradient where the derivate exists, hence everywhere but not in zero.
The tensorflow authors decided to do not check if the input is zero and throw an exception in that specific case
In order to prevent TensorFlow from throwing an error, the only real requirement is that you cost function evaluates to a number for any value of your input variables. From a purely "will it run" perspective, it doesn't know/care about the form of the function its trying to minimize.
In order for your cost function to provide you a meaningful result when TensorFlow uses it to train a model, it additionally needs to 1) get smaller as your model does better and 2) be bounded from below (i.e. it can't go to negative infinity). It's not generally necessary for it to be smooth (e.g. abs(x) has a kink where the sign flips). Tensorflow is always able to compute gradients at any location using automatic differentiation (https://en.wikipedia.org/wiki/Automatic_differentiation, https://www.tensorflow.org/versions/r0.12/api_docs/python/train/gradient_computation).
Of course, those gradients are of more use if you've chose a meaningful cost function isn't isn't too flat.
Ideally, the cost function needs to be smooth everywhere to apply gradient based optimization methods (SGD, Momentum, Adam, etc). But nothing's going to crash if it's not, you can just have issues with convergence to a local minimum.
When the function is non-differentiable at a certain point x, it's possible to get large oscillations if the neural network converges to this x. E.g., if the loss function is tf.abs(x), it's possible that the network weights are mostly positive, so the inference x > 0 at all times, so the network won't notice tf.abs. However, it's more likely that x will bounce around 0, so that the gradient is arbitrarily positive and negative. If the learning rate is not decaying, the optimization won't converge to the local minimum, but will bound around it.
In your particular case, the gradient is zero all the time, so nothing's going to change at all.
If it didn't learn anything, what have you gained ? Your loss function is differentiable almost everywhere but it is flat almost anywhere so the minimizer can't figure out the direction towards the minimum.
If you start out with a positive value, it will most likely be stuck at a random value on the positive side even though the minima on the left side are better (have a lower value).
Tensorflow can be used to do calculations in general and it provides a mechanism to automatically find the derivative of a given expression and can do so across different compute platforms (CPU, GPU) and distributed over multiple GPUs and servers if needed.
But what you implement in Tensorflow does not necessarily have to be a goal function to be minimized. You could use it e.g. to throw random numbers and perform Monte Carlo integration of a given function.

Inference on several inputs in order to calculate the loss function

I am modeling a perceptual process in tensorflow. In the setup I am interested in, the modeled agent is playing a resource game: it has to choose 1 out of n resouces, by relying only on the label that a classifier gives to the resource. Each resource is an ordered pair of two reals. The classifier only sees the first real, but payoffs depend on the second. There is a function taking first to second.
Anyway, ideally I'd like to train the classifier in the following way:
In each run, the classifier give labels to n resources.
The agent then gets the payoff of the resource corresponding to the highest label in some predetermined ranking (say, A > B > C > D), and randomly in case of draw.
The loss is taken to be the normalized absolute difference between the payoff thus obtained and the maximum payoff in the set of resources. I.e., (Payoff_max - Payoff) / Payoff_max
For this to work, one needs to run inference n times, once for each resource, before calculating the loss. Is there a way to do this in tensorflow? If I am tackling the problem in the wrong way feel free to say so, too.
I don't have much knowledge in ML aspects of this, but from programming point of view, I can see doing it in two ways. One is by copying your model n times. All the copies can share the same variables. The output of all of these copies would go into some function that determines the the highest label. As long as this function is differentiable, variables are shared, and n is not too large, it should work. You would need to feed all n inputs together. Note that, backprop will run through each copy and update your weights n times. This is generally not a problem, but if it is, I heart about some fancy tricks one can do by using partial_run.
Another way is to use tf.while_loop. It is pretty clever - it stores activations from each run of the loop and can do backprop through them. The only tricky part should be to accumulate the inference results before feeding them to your loss. Take a look at TensorArray for this. This question can be helpful: Using TensorArrays in the context of a while_loop to accumulate values

Tensorflow: opt.compute_gradients() returns values different from the weight difference of opt.apply_gradients()

Question: What is the most efficient way to get the delta of my weights in the most efficient way in a TensorFlow network?
Background: I've got the operators hooked up as follows (thanks to this SO question):
self.cost = `the rest of the network`
self.rmsprop = tf.train.RMSPropOptimizer(lr,rms_decay,0.0,rms_eps)
self.comp_grads = self.rmsprop.compute_gradients(self.cost)
self.grad_placeholder = [(tf.placeholder("float", shape=grad[1].get_shape(), name="grad_placeholder"), grad[1]) for grad in self.comp_grads]
self.apply_grads = self.rmsprop.apply_gradients(self.grad_placeholder)
Now, to feed in information, I run the following:
feed_dict = `training variables`
grad_vals = self.sess.run([grad[0] for grad in self.comp_grads], feed_dict=feed_dict)
feed_dict2 = `feed_dict plus gradient values added to self.grad_placeholder`
self.sess.run(self.apply_grads, feed_dict=feed_dict2)
The command of run(self.apply_grads) will update the network weights, but when I compute the differences in the starting and ending weights (run(self.w1)), those numbers are different than what is stored in grad_vals[0]. I figure this is because the RMSPropOptimizer does more to the raw gradients, but I'm not sure what, or where to find out what it does.
So back to the question: How do I get the delta on my weights in the most efficient way? Am I stuck running self.w1.eval(sess) multiple times to get the weights and calc the difference? Is there something that I'm missing with the tf.RMSPropOptimizer function.
Thanks!
RMSprop does not subtract the gradient from the parameters but use more complicated formula involving a combination of:
a momentum, if the corresponding parameter is not 0
a gradient step, rescaled non uniformly (on each coordinate) by the square root of the squared average of the gradient.
For more information you can refer to these slides or this recent paper.
The delta is first computed in memory by tensorflow in the slot variable 'momentum' and then the variable is updated (see the C++ operator).
Thus, you should be able to access it and construct a delta node with delta_w1 = self.rmsprop.get_slot(self.w1, 'momentum'). (I have not tried it yet.)
You can add the weights to the list of things to fetch each run call. Then you can compute the deltas outside of TensorFlow since you will have the iterates. This should be reasonably efficient, although it might incur an extra elementwise difference, but to avoid that you might have to hack around in the guts of the optimizer and find where it puts the update before it applies it and fetch that each step. Fetching the weights each call shouldn't do wasteful extra evaluations of part of the graph at least.
RMSProp does complicated scaling of the learning rate for each weight. Basically it divides the learning rate for a weight by a running average of the magnitudes of recent gradients of that weight.