Keras - coding a custom optimizer and attempting to compute a second gradient inside get_updates - tensorflow

I am a researcher in optimization and I trying to write a custom optimizer. I have come across a problem. I have asked in many places and so far no response.
Take any optimizer code, say just copy SGD. In the beginning of get_updates, you see
grads = self.get_gradients(loss, params)
now add the following line right after this one:
gradsb = self.get_gradients(loss, [tf.Variable(a) for a in params])
this should compute the gradients at a new tensor, with all the values the same as before
now try to see what you get:
for a in gradsb:
print(a)
you get a list of Nones (but if you print the list grads you see that they are still Tensors)
Why?
And how to circumvent this problem? This is important as I'd like to compute the gradients at another point for my algorithm.

When you write gradsb = self.get_gradients(loss, [tf.Variable(a) for a in params]) you are defining a new tf.Variable for each a in params. Because the loss does not depend on these new variables, your gradients are None.
If you want to compute a second gradient you need to make sure that you're computing it with respect to Tensors that the objective does depend on.

Apparently even replacing the current vector of parameters is not OK!! If I type this in the code:
grads = self.get_gradients(loss, params)
tempparam = [tf.Variable(a) for a in params]
params = [tf.add(a,a) for a in params]
gradsn = self.get_gradients(loss, params)
for a in gradsn:
print(a)
params = [tf.Variable(a) for a in tempparam]
The result is still that None is printed!!
I know you understand what I am trying to do, at each iteration of get_updates, I would like to compute the gradients at a (slightly) different value of the parameter tensors, and use that to construct the update to the parameters for optimization and training. Is there any way to do this within the keras package?

Related

Loss reduction in tf.distributed.MirroredStrategy()

I'm confused regarding using distributed strategy and the correct way of reduction in loss functions.
I implemented a U-Net using tf.distribute.MirroredStrategy(). Everything works fine using default loss BinaryCrossentropy as follows:
with strategy.scope():
model = build_network((size, size, 3), num_classes)
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=args.learning_rate),
loss=tf.keras.losses.BinaryCrossentropy()])
However, I want to create custom loss functions. To start with, I wrote a wrapper containing BinaryCrossentropy, to get familiar with the correct way of using the reduction methods. I followed the instructions in https://www.tensorflow.org/tutorials/distribute/custom_training#define_the_loss_function
and used tf.nn.compute_average_loss in order to divide by the global batch_size.
def loss_functions(loss_spec):
if loss_spec == 'cross_entropy':
def c_loss(truth, pred):
my_loss = tf.keras.losses.BinaryCrossentropy(reduction=tf.keras.losses.Reduction.NONE)(truth, pred)
my_loss = tf.math.reduce_mean(my_loss, axis=[1, 2]) # to compute average across the two image dimensions
my_loss = tf.nn.compute_average_loss(my_loss) # sums up all items and divides by batch size
return my_loss
return c_loss
which is called in the following way:
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=args.learning_rate),
loss=utils.loss_functions('cross_entropy')])
It also works, but I realised a difference of factor of number or replica compared to using tf.keras.losses.BinaryCrossentropy(). I.e., when using two kernels, using BinaryCrossentropy() directly yields a loss twice as large as my custom loss. Thus, to geht the same, I would need to divide by the batch size per replica instead of global batch size, i.e., the way it should NOT be done according to the documentation.
However, the documentation refers to building an own training routine, whereas I am using model.compile() and model.fit() methods.
Can anybody explain this behaviour to me?
UPDATE:
The use of tf.nn.compute_average_loss or the use of any reduction on the batch axis is not needed when using model.compile() and model.fit() at all - the reduction and scaling is done automatically. However, I still do not know how model.fit() does work internally.
Thanks and cheers, everybody

Using tf.cond() to feed my graph for training and validation

In my TensorFlow code I want my network to take inputs from one of the two StagingArea objects depending upon whether I want to do training or testing.
A part of the graph construction code I wrote is as follows :
with tf.device("/gpu:0"):
for i in range(numgpus):
with tf.variable_scope(tf.get_variable_scope(), reuse=i>0) as vscope:
with tf.device('/gpu:{}'.format(i)):
with tf.name_scope('GPU-Tower-{}'.format(i)) as scope:
phase = tf.get_variable("phase", [], initializer=tf.zeros_initializer(),dtype=tf.uint8, trainable=False)
phaseassigntest = phase.assign(1)
phaseassigntrain = phase.assign(0)
phasetest = tf.equal(phase, 0)
is_training = tf.cond(phasetest, lambda: tf.constant(True), lambda: tf.constant(False))
trainstagingarea = tf.contrib.staging.StagingArea([tf.float32, tf.int32], shapes=[[trainbatchsize, 3, 221, 221], [trainbatchsize]], capacity=20)
putoptrain = trainstagingarea.put(train_iterator.get_next())
trainputop.append(putoptrain)
getoptrain = trainstagingarea.get()
traingetop.append(getoptrain)
trainclearop = trainstagingarea.clear()
trainstageclear.append(trainclearop)
trainsizeop = trainstagingarea.size()
trainstagesize.append(trainsizeop)
valstagingarea = tf.contrib.staging.StagingArea([tf.float32, tf.int32], shapes=[[valbatchsize, 3, 221, 221], [valbatchsize]], capacity=20)
putopval = valstagingarea.put(val_iterator.get_next())
valputop.append(putopval)
getopval = valstagingarea.get()
valgetop.append(getopval)
valclearop = valstagingarea.clear()
valstageclear.append(valclearop)
valsizeop = valstagingarea.size()
valstagesize.append(valsizeop)
#elem = valgetop[i]
elem = tf.cond(is_training,lambda: traingetop[i],lambda: valgetop[i])
img = elem[0]
label = elem[1]
labelonehot = tf.one_hot(label, depth=numclasses)
net, networksummaries = overfeataccurate(img,numclasses=numclasses, phase=is_training)
I have used tf.cond to make sure that the network is fed by one of the two StagingArea objects. One is meant for training and the other one is meant for validation.
Now, when I try to execute the graph as follows, I do not get any result and infact the code just hangs and I have to kill the process.
with tf.Session(graph=g,config=config) as sess:
sess.run(init_op)
sess.run(tf.local_variables_initializer())
sess.run(val_initialize)
for i in range(20):
sess.run(valputop)
print(sess.run(valstagesize))
writer = tf.summary.FileWriter('.', graph=tf.get_default_graph())
epoch = 0
iter = 0
print("Performing Validation")
sess.run(phaseassigntest)
saver = tf.train.Saver()
while(epoch<10):
time_init = time.time()
while True:
try:
[val_accu, _, summaries] = sess.run([towervalidation, towervalidationupdateop,validation_summary_op])
print(val_accu)
when instead of tf.cond() I directly assign elem = valgetop[i], the code works just fine.
Am I missing something over here ?
What is the right way to feed my network based on whether I want to do training or testing ?
NOTE The error does not go away even if I set numgpus to 1.
Your problem
What you think tf.cond does
Based on the flag, execute what is required to put either traingetop[i] or valgetop[i] into your elem tensor.
What tf.cond actually does
Executes what is required to get both traingetop[i] and valgetop[i], then passes one of them into your elem tensor.
So
The reason it is hanging forever is because it's waiting for an element to be added to your training staging area (so that it can get that element and discard it). You're forgiven for not realising this is what it's doing; it's actually very counter-intuitive. The documentation is awfully unclear on how to deal with this.
Recommended Solution (by Tensorflow documentation)
If you really need the queues to be in the same graph, then you need to make two copies of your ENTIRE graph, one that is fed by your training staging area, and one that is fed by your validation staging area. Then you just use the relevant tensor in your sess.run call. I recommend creating a function that takes a queue output tensor, and returns a model_output tensor. Now you have a train_time_output tensor and a validation_time_output tensor, and you can choose which one you want to execute in your sess.run.
Warning
You need to make sure that you aren't actually creating new variables to go along with these new ops. To do that take a look at the latest documentation on variables. It looks like they've simplified it from v0.12, and it essentially boils down to using tf.get_variable instead of tf.Variable to create your variables.
My preferred work around
Although that is the recommended solution (AFAIK), it is extremely unsatisfying to me; you're creating a whole other set of operations on the graph that just happen to use the same weights. It seems like there's a lot of potential for programmer error by abusing the separation between train time and test/validation time (resulting in the model acting unexpectedly different at these times). Worse; it doesn't solve the problem of tf.cond demanding the values for inputs to both branches, it just forces you to copy your whole graph, which is not always possible.
I prefer to just not have my queues in the graph like that, and treat the model as a function which can be fed an example without caring where it's from. That is, I would instantiate the model with a tf.placeholder as the input, and at execution time I would use feed_dict to actually provide the value. It would function something like this
#inside main training loop
if time_to_train:
example = sess.run(traingettop)
else:
example = sess.run(valgettop)
result = sess.run(model_output, {input_placeholder: example})
It's very useful to note that you can use the feed_dict to feed any value for any tensor anywhere in your model. So, you can change any model definition that, due to tf.cond would always require an input, like:
a = tf.constant(some_value)
b = tf.placeholder(tf.float32)
flag = tf.placeholder(tf.bool, [])
one_of_them = tf.cond(flag, a, b)
model_output = build_graph(one_of_them)
Into a definition that doesn't, like:
a = tf.constant(some_value)
model_output = build_graph(a)
Remembering that you can always overwrite what a is at execution time:
# In main training loop,
sess.run(train_op, {a: some_other_value})
This essentially pushes the conditional into native python land. In your code you might end up with something like:
if condition_satisfied:
sess.run(train_op, {a:some_other_value})
else:
sess.run(train_op)
Performance concerns
If you are using tensorflow on a single machine, then there is practically no performance cost to this solution, as the numpy array/s put into the example python variable are actually still stored on the GPU.
If you are using tensorflow in a distributed fashion, then this solution would kill your performance; it would require sending the example from whatever machine it's on to the master so that it can send it back.

Tensorflow update only selected variables

Overview: I want to update only selected variables in a network. The network has parts A->B (in forward direction) and each of them has separate losses La and Lb. I want to train the weights a of A to optimize Lb. While doing this, the weights b of B should be fixed. How can I do this?
Approach 1: Select only a as variables to minimize using var_list in optimizer.minimize(loss, var_list=[a]).
https://github.com/tensorflow/tensorflow/issues/834 . This crashes with an error ValueError: No gradients provided for any variable, check your graph for ops that do not support gradients, between variables (...) and loss (...). This actually works fine in other scenarios, but apparently it does not like that weights b are not in the var_list.
Edit 1: The line that causes the error: a_optim = tf.train.AdamOptimizer(args.lr, beta1=args.beta1).minimize(self.a_loss, var_list=self.a_vars, global_step=self.global_step)
Approach 2: Same as Approach 1, but also include b in the var_list. The problem is now that the network updates a and b, whereas it should just send the gradients through B and only update A.
Edit 2: The line that works, but is not what I want: a_optim = tf.train.AdamOptimizer(args.lr, beta1=args.beta1).minimize(self.a_loss, var_list=self.a_vars+self.b_vars, global_step=self.global_step)
Approach 3: Use tf.stop_gradient(tensor) Holding variables constant during optimizer . From the documentation I infer that this only stops the gradients from flowing further to the left in the graph. I want the ignore variables on the right.
Approach 4: Set tf.Variable(..., trainable=True), but that looks very inflexible if I want to alternate training between A and B.
I found that, for a better control of which variables to update during the optimization, it is better to use: 'compute_gradients' and 'apply_gradients' approach.
The compute_gradients will return a list of tuple of gradients and variables tensors. You can modify the returning gradient tensors whatever you want and also be able to select the subset of variables for updating.
Then, you pass a list of tuple of gradients and variables that you want to update to 'apply_gradients'
Here are some examples:
optimizer = tf.train.AdamOptimizer(learning_rate=0.0001)
grads = optimizer.compute_gradients(your_cost_function)
# You can update 'g' and exclude some v's
grad_lists = [(g, v) for g, v in grads]
train_op = optimizer.apply_gradients(grad_lists)
Then, run your session.
sess.run(train_op, feed_dict={...})
Also, since you have 2 loss functions, you should create 2 train operations.
Hope this help!
It turns out that the final op in A was non-differentiable (tf_argmax) and therefore obviously gradients could not be passed from B to A.

How do you create a dynamic_rnn with dynamic "zero_state" (Fails with Inference)

I have been working with the "dynamic_rnn" to create a model.
The model is based upon a 80 time period signal, and I want to zero the "initial_state" before each run so I have setup the following code fragment to accomplish this:
state = cell_L1.zero_state(self.BatchSize,Xinputs.dtype)
outputs, outState = rnn.dynamic_rnn(cell_L1,Xinputs,initial_state=state, dtype=tf.float32)
This works great for the training process. The problem is once I go to the inference, where my BatchSize = 1, I get an error back as the rnn "state" doesn't match the new Xinputs shape. So what I figured is I need to make "self.BatchSize" based upon the input batch size rather than hard code it. I tried many different approaches, and none of them have worked. I would rather not pass a bunch of zeros through the feed_dict as it is a constant based upon the batch size.
Here are some of my attempts. They all generally fail since the input size is unknown upon building the graph:
state = cell_L1.zero_state(Xinputs.get_shape()[0],Xinputs.dtype)
.....
state = tf.zeros([Xinputs.get_shape()[0], self.state_size], Xinputs.dtype, name="RnnInitializer")
Another approach, thinking the initializer might not get called until run-time, but still failed at graph build:
init = lambda shape, dtype: np.zeros(*shape)
state = tf.get_variable("state", shape=[Xinputs.get_shape()[0], self.state_size],initializer=init)
Is there a way to get this constant initial state to be created dynamically or do I need to reset it through the feed_dict with tensor-serving code? Is there a clever way to do this only once within the graph maybe with an tf.Variable.assign?
The solution to the problem was how to obtain the "batch_size" such that the variable is not hard coded.
This was the correct approach from the given example:
Xinputs = tf.placeholder(tf.int32, (None, self.sequence_size, self.num_params), name="input")
state = cell_L1.zero_state(Xinputs.get_shape()[0],Xinputs.dtype)
The problem is the use of "get_shape()[0]", this returns the "shape" of the tensor and takes the batch_size value at [0]. The documentation doesn't seem to be that clear, but this appears to be a constant value so when you load the graph into an inference, this value is still hard coded (maybe only evaluated at graph creation?).
Using the "tf.shape()" function, seems to do the trick. This doesn't return the shape, but a tensor. So this seems to be updated more at run-time. Using this code fragment solved the problem of a training batch of 128 and then loading the graph into TensorFlow-Service inference handling a batch of just 1.
Xinputs = tf.placeholder(tf.int32, (None, self.sequence_size, self.num_params), name="input")
batch_size = tf.shape(Xinputs)[0]
state = self.cell_L1.zero_state(batch_size,Xinputs.dtype)
Here is a good link to TensorFlow FAQ which describes this approach 'How do I build a graph that works with variable batch sizes?':
https://www.tensorflow.org/resources/faq

does TensorFlow automatically use sparse_softmax_cross_entropy_with_logits when possible?

Let's say that I have some code such as:
out = tf.nn.softmax(x) # shape (batch,time,n)
labels = .... # reference labels of type (batch,time)->int
And then I define my loss as the Cross Entropy:
loss = -tf.log(tf.gather_nd(out, labels))
Will TensorFlow automatically replace the loss in the computation graph by this?
loss = sparse_softmax_cross_entropy_with_logits(x, labels)
What type of optimizations can I expect that TensorFlow will apply?
Follow-up question: If TensorFlow doesn't do this optimization, how can I do it manually? Consider that I have a modular framework where I get some out tensor which could possibly be the output of a softmax operation, and I want to calculate Cross Entropy, and I want to use sparse_softmax_cross_entropy_with_logits if possible. How could I accomplish this? Can I do something like the following?
if out.op == "softmax": # how to check this?
x = out.op.sources[0] # how to get this?
loss = sparse_softmax_cross_entropy_with_logits(x, labels)
else:
loss = -tf.log(tf.gather_nd(out, labels))
TensorFlow generally doesn't merge nodes together in the way you're hoping. This is because other code (e.g. fetching outputs when running) may depend on intermediate nodes like the softmax, so removing them behind the user's back would be confusing.
If you do want to do this optimization yourself as part of a higher-level framework, you can analyze the current graphdef, but there's no annotation in TF to tell you what the outputs are, since that can vary at runtime depending on how session.run is called.