questions on defining and calling train.momentumoptimizer - tensorflow

I have some questions regarding the following code segment
def _optimizer(self,training_iters, global_step, opt_kwargs={}):
learning_rate = self.opt_kwargs.pop("learning_rate", 0.2)
decay_rate = self.opt_kwargs.pop("decay_rate", 0.95)
self.learning_rate_node = tf.train.exponential_decay(learning_rate=learning_rate,
global_step=global_step,
decay_steps=training_iters,
decay_rate=decay_rate,
staircase=True)
optimizer = tf.train.MomentumOptimizer(learning_rate=self.learning_rate_node,
**self.opt_kwargs).minimize(self.net.cost,
global_step=global_step)
The input pararameter of opt_kwargs is setup as opt_kwargs=dict(momentum=0.2)
Why we need to use self.opt_kwargs.pop("learning_rate", 0.2) to assign learning_rate. My guess is that this way can inject the learning rate and decay rate information into the dict structure of opt_kwargs. But I don't see the real usage here.
Secondly, regarding tf.train.MomentumOptimizer(learning_rate=self.learning_rate_node,
**self.opt_kwargs), looks like **self.opt_kwargs will pass the whole opt_kwargs dict into the MomentumOptimizer. However, according to tf.train.MomentumOptimizer.init(learning_rate, momentum, use_locking=False, name='Momentum', use_nesterov=False), it only needs the momentum value. Here, we are passing both learning_rate and decay_rate included in self.opt_kwargs. Is this a correct way?

1.) The argument pop is so that you extract the learning_rate and decay_rate value and feed it to exponential_decay(), which accepts them as individual argument. 2.) It's not clean but is ok to feed in a dict with extra entries. This makes it flexible so that ex. you can easily swap MomentumOptimizer with another optimizer that takes in decay_rate, etc as part of arguments.

tf.train.MomentumOptimizer.init(learning_rate, momentum, use_locking=False, name='Momentum', use_nesterov=False) This means you need to explicitly pass a momentum value to the function. For self.opt_kwargs.pop, you do not need to pass a "learning_rate" or "decay_rate" to your function since they are set default using 0.2 and 0.95.

Related

Is it relevant to use both feature normalizer_fn and batch normalization?

Is it relevant to use both feature normalizer_fn and batch normalization like following ?
feature_columns_complex_standardized = [
tf.feature_column.numeric_column("my_feature", normalizer_fn=lambda x: (x - xMean) / xStd)
]
model1 = tf.estimator.DNNClassifier(feature_columns=feature_columns_complex_standardized,
hidden_units=[512,512,512],
optimizer=tf.train.AdamOptimizer(learning_rate=0.001, beta1= 0.9,beta2=0.99, epsilon = 1e-08,use_locking=False),
weight_column=weights,
dropout=0.5,
activation_fn=tf.nn.softmax,
n_classes=10,
label_vocabulary=Action_vocab,
model_dir='./Models9/Action/',
loss_reduction=tf.losses.Reduction.SUM_OVER_BATCH_SIZE,
config=tf.estimator.RunConfig().replace(save_summary_steps=10),
batch_norm=True)
May be you get it wrong, as Normalization is one of the methods used to bring features in a dataset to the same scale, where batch normalization is used for solving the problem of internal covariate shift where each hidden unit’s input distribution changes every time there is a parameter update in the previous layer.
So you can use both at the same time.

Keras custom loss function with binary (round) with tensorflow backend

I'm currently trying to implement a custom loss function (precision) with a binary outcome but Tensorflow backend refuses to use round function which is necessary to be used in order to generate a '0' or '1'.
As far as I have investigated, this is because Tensorflow defines the gradient of the round as None and the loss function can't return None.
I have currently implemented this custom loss to create as close as is possible '0' or '1' in R Keras interface.
precision_loss<-function(y_true,y_pred){
y_pred_pos = K$clip(y_pred, 0, 1)
#Custom sigmoid to generate '0' '1'
y_pred_pos = K$maximum(0,K$minimum(1,(y_pred_pos+0.0625)/0.125))
y_pred_neg = 1 - y_pred_pos
y_pos = K$clip(y_true, 0, 1)
#Custom sigmoid to generate '0' '1'
y_pos = K$maximum(0,K$minimum(1,(y_pos+0.0625)/0.125))
y_neg = 1 - y_pos
#Generate confusion matrix counts
tp = K$sum(y_pos*y_pred_pos)
tn = K$sum(y_neg*y_pred_neg)
fp = K$sum(y_neg*y_pred_pos)
fn = K$sum(y_pos*y_pred_neg)
return(1-(tp/(tp+fp+K$epsilon())))
}
Notice the "sigmoid" : K$maximum(0,K$minimum(1,(y_pos+0.0625)/0.125))
What I wanted to implement is a workaround for this one:
precision_loss<-function(y_true, y_pred){
y_pred_pos = K$round(K$clip(y_pred, 0, 1))
y_pred_neg = 1 - y_pred_pos
y_pos = K$round(K$clip(y_true, 0, 1))
y_neg = 1 - y_pos
#Generate confusion matrix counts
tp = K$sum(K$clip(y_pos * y_pred_pos,0,1))
tn = K$sum(K$clip(y_neg * y_pred_neg,0,1))
fp = K$sum(K$clip(y_neg * y_pred_pos,0,1))
fn = K$sum(K$clip(y_pos * y_pred_neg,0,1))
return(1-(tp/(tp+fp+K$epsilon())))
}
Some of you have an alternative implementation without using round to generate binary outcomes in the loss function?
PD: In custom metrics function the round is allowed
In order to build a binary loss function, it wouldn't be enough to just build the custom loss function itself. You would also have to pre-define the gradients.
Your high-dimensional loss function would be zero for some points and one for all others. For all non-continuous points in this space, it would be impossible to analytically compute a gradient (i.e. the concept of a gradient doesn't even exist for such points), so you would have to just define one. And for all the continuous points in this space (e.g. an open set in which all loss values are 1), the gradient would exist, but it would be zero, so you would also have to pre-define the gradient values, otherwise your weights wouldn't move at all.
That means either way you would have to define your own custom "gradient" computation function that replaces Keras' (i.e. TensorFlow's) automatic differentiation engine for that particular node in the graph (the loss function node).
You could certainly achieve this by modifying your local copy of Keras or TensorFlow, but nothing good can come from it.
Also, even if you managed to do this, consider this: If your loss function returns only 0 or 1, that means it can only distinguish between two states: The model's prediction is either 100% correct (0 loss) or it is not 100% correct (1 loss). The magnitude of the gradient would have to be the same for all non-100% cases. Is that a desirable property?
Your quasi-binary sigmoid solution has the same problem: The gradient will be almost zero almost everywhere, and in the few points where it won't be almost zero, it will be almost infinity. If you try to train a model with that loss function, it won't learn anything.
As you noticed a custom loss function need to be based on functions which have their gradients defined (in order to minimise the loss function), which is not necessary for a simple metric. Some functions like “round” and “sign” are difficult to use in loss function since their gradients are either null all the time or infinite which is not helpful for minimisation. That’s probably why their gradients are not defined, by default.
Then, you have two options:
Option 1: you use the round function but you need to add your custom gradient for round, to substitute it in backend.
Option 2: you define another loss function without using round
You chose option 2, which is the best option I think. But your “sigmoid” is very linear, so probably, not a good approximation of your “round” function. You could use an actual sigmoid which is slower due to the use of exponential but you could obtain a similar result with a modified softsign:
max_gradient=100
K$maximum(0,K$minimum(1,0.5*(1+(max_gradient*y_pos)/(1+ max_gradient*abs(y_pos)))))
The max_gradient coefficient can be used to make your edge more sharp, around 0.5. It defines the maximum gradient at 0.5.

Cache intermediate tensor and update periodically

I have a large tensor that is expensive to calculate, but realistically I only need to recalculate it every 10 iterations or so (during gradient descent). What's the best way to do this?
More specifically:
Suppose I have an intermediate_tensor that is used in the calculation of final_tensor each time the a tf.Session is run. final_tensor is, in my case, a set of modified gradients to use in optimization. It is possible to define a graph that contains both intermediate_tensor and final_tensor. However, running this graph will be inefficient when intermediate_tensor changes slowly. In pseudocode, this is what I'd like to do:
intermediate_tensor = tf.some_operation(earlier_variable)
final_tensor = tf.matmul(intermediate_tensor, other_earlier_variable)
with tf.Session() as sess:
# pretending `partial_run` works like I want it to:
sess.partial_run(intermediate_tensor, feed_dict = {})
for i in range(5):
ft = sess.partial_run(final_tensor, feed_dict = {})
print(ft)
The experimental partial_run feature is almost what I'm looking for. However, partial_run can only be used if I want to evaluate final_tensor just once for each time I evaluate intemediate_tensor. It won't work for a for loop.
My workaround for the moment is to use tf.placeholder. I evaluate intermediate_tensor in one call to sess.run, then feed the result to a new call of sess.run as a placeholder. However, this is very inflexible. It requires that I hardcode the variable shape at compile time, for example. It's also not very good when the number of intermediate variables I'd like to use is very large.
Is there a better way? This would be very helpful if, say, one were using a curvature matrix that doesn't need to be evaluated every iteration.

How to fetch gradients with respect to certain occurrences of variables in tensorflow?

Since tensorflow supports variable reuse, some part of computing graph may occur multiple times in both forward and backward process. So my question is, is it possible to update variables with respect their certain occurrences in the compute graph?
For example, in X_A->Y_B->Y_A->Y_B, Y_B occurs twice, how to update them respectively? I mean, at first, we take the latter occurrence as constant, and update the previous one, then do opposite.
A more simple example is, say X_A, Y_B, Y_A are all scalar variable, then let Z = X_A * Y_B * Y_A * Y_B, here the gradient of Z w.r.t both occurrences of Y_B is X_A * Y_B * Y_A, but actually the gradient of Z to Y_B is 2*X_A * Y_B * Y_A. In this example computing gradients respectively may seems unnecessary, but not always are those computation commutative.
In the first example, gradients to the latter occurrence may be computed by calling tf.stop_gradient on X_A->Y_B. But I could not think of a way to fetch the previous one. Is there a way to do it in tensorflow's python API?
Edit:
#Seven provided an example on how to deal with it when reuse a single variable. However often it's a variable scope that is reused, which contains many variables and functions that manage them. As far as I know, their is no way to reuse a variable scope with applying tf.stop_gradient to all variables it contains.
With my understanding, when you use A = tf.stop_gradient(A), A will be considered as a constant. I have an example here, maybe it can help you.
import tensorflow as tf
wa = tf.get_variable('a', shape=(), dtype=tf.float32,
initializer=tf.constant_initializer(1.5))
b = tf.get_variable('b', shape=(), dtype=tf.float32,
initializer=tf.constant_initializer(7))
x = tf.placeholder(tf.float32, shape=())
l = tf.stop_gradient(wa*x) * (wa*x+b)
op_gradient = tf.gradients(l, x)
sess = tf.Session()
sess.run(tf.global_variables_initializer())
print sess.run([op_gradient], feed_dict={x:11})
I have a workaround for this question. Define a custom getter for the concerning variable scope, which wraps the default getter with tf.stop_gradient. This could set all variables returned in this scope as a Tensor contributing no gradients, though sometimes things get complicated because it returns a Tensor instead of a variable, such as when using tf.nn.batch_norm.

What is the best way to implement weight constraints in TensorFlow?

Suppose we have weights
x = tf.Variable(np.random.random((5,10)))
cost = ...
And we use the GD optimizer:
upds = tf.train.GradientDescentOptimizer(lr).minimize(cost)
session.run(upds)
How can we implement for example non-negativity on weights?
I tried clipping them:
upds = tf.train.GradientDescentOptimizer(lr).minimize(cost)
session.run(upds)
session.run(tf.assign(x, tf.clip_by_value(x, 0, np.infty)))
But this slows down my training by a factor of 50.
Does anybody know a good way to implement such constraints on the weights in TensorFlow?
P.S.: in the equivalent Theano algorithm, I had
T.clip(x, 0, np.infty)
and it ran smoothly.
You can take the Lagrangian approach and simply add a penalty for features of the variable you don't want.
e.g. To encourage theta to be non-negative, you could add the following to the optimizer's objective function.
added_loss = -tf.minimum( tf.reduce_min(theta),0)
If any theta are negative, then add2loss will be positive, otherwise zero. Scaling that to a meaningful value is left as an exercise to the reader. Scaling too little will not exert enough pressure. Too much may make things unstable.
As of TensorFlow 1.4, there is a new argument to tf.get_variable that allows to pass a constraint function that is applied after the update of the optimizer. Here is an example that enforces a non-negativity constraint:
with tf.variable_scope("MyScope"):
v1 = tf.get_variable("v1", …, constraint=lambda x: tf.clip_by_value(x, 0, np.infty))
constraint: An optional projection function to be applied to the
variable
after being updated by an Optimizer (e.g. used to implement norm
constraints or value constraints for layer weights). The function must
take as input the unprojected Tensor representing the value of the
variable and return the Tensor for the projected value
(which must have the same shape). Constraints are not safe to
use when doing asynchronous distributed training.
By running
sess.run(tf.assign(x, tf.clip_by_value(x, 0, np.infty)))
you are consistently adding nodes to the graph and making it slower and slower.
Actually you may just define a clip_op when building the graph and run it each time after updating the weights:
# build the graph
x = tf.Variable(np.random.random((5,10)))
loss = ...
train_op = tf.train.GradientDescentOptimizer(lr).minimize(loss)
clip_op = tf.assign(x, tf.clip(x, 0, np.infty))
# train
sess.run(train_op)
sess.run(clip_op)
I recently had this problem as well. I discovered that you can import keras which has nice weight constraint functions as use them directly in the kernen constraint in tensorflow. Here is an example of my code. You can do similar things with kernel regularizer
from keras.constraints import non_neg
conv1 = tf.layers.conv2d(
inputs=features['x'],
filters=32,
kernel_size=[5,5],
strides = 2,
padding='valid',
activation=tf.nn.relu,
kernel_regularizer=None,
kernel_constraint=non_neg(),
use_bias=False)
There is a practical solution: Your cost function can be written by you, to put high cost onto negative weights. I did this in a matrix factorization model in TensorFlow with python, and it worked well enough. Right? I mean it's obvious. But nobody else mentioned it so here you go. EDIT: I just saw that Mark Borderding also gave another loss and cost-based solution implementation before I did.
And if "the best way" is wanted, as the OP asked, what then? Well "best" might actually be application-specific, in which case you'd need to try a few different ways with your dataset and consider your application requirements.
Here is working code for increasing the cost for unwanted negative solution variables:
cost = tf.reduce_sum(keep_loss) + Lambda * reg # Cost = sum of losses for training set, except missing data.
if prefer_nonneg: # Optionally increase cost for negative values in rhat, if you want that.
negs_indices = tf.where(rhat < tf.constant(0.0))
neg_vals = tf.gather_nd(rhat, negs_indices)
cost += 2. * tf.reduce_sum(tf.abs(neg_vals)) # 2 is a magic number (empirical parameter)
You are free to use my code but please give me some credit if you choose to use it. Give a link to this answer on stackoverflow.com please.
This design would be considered a soft constraint, because you can still get negative weights, if you let it, depending on your cost definition.
It seems that constraint= is also available in TF v1.4+ as a parameter to tf.get_variable(), where you can pass a function like tf.clip_by_value. This seems like another soft constraint, not hard constraint, in my opinion, because it depends on your function to work well or not. It also might be slow, as the other answerer tried the same function and reported it was slow to converge, although they didn't use the constraint= parameter to do this. I don't see any reason why one would be any faster than the other since they both use the same clipping approach. So if you use the constraint= parameter then you should expect slow convergence in the context of the original poster's application.
It would be nicer if also TF provided true hard constraints to the API, and let TF figure out how to both implement that as well as make it efficient on the back end. I mean, I have seen this done in linear programming solvers already for a long time. The application declares a constraint, and the back end makes it happen.