Why is gradient clipping not supported with a distribution strategy in Tensorflow? - tensorflow

It looks like gradient clipping is not supported using a distribution strategy
https://github.com/tensorflow/tensorflow/blob/f9f6b4cec2a1bdc5781e4896d80cee1336a2fbab/tensorflow/python/keras/optimizer_v2/optimizer_v2.py#L383
("Gradient clipping in the optimizer "
"(by setting clipnorm or clipvalue) is currently "
"unsupported when using a distribution strategy.")
Any reason for this? I am tempted to define a custom def _minimize(strategy, tape, optimizer, loss, trainable_variables): with direct clipping the gradients.

GitHub user tomerk wrote:
There's two possible places to clip when you have distribution
strategies enabled:
before gradients get aggregated (usually wrong)
after gradients get aggregated (usually right & what people expect)
We want it working w/ the second case (clipping after gradients are
aggregated). The issue is the optimizers are written with clipping
happening in the code before aggregation does.
We looked into changing this, but it would have required either:
api changes that break existing users of optimizer apply_gradients/other non-minimize methods
changing the signatures of methods optimizer implementers need to implement, breaking existing custom optimizers
So rather than:
quietly doing clipping in the wrong place
increasing churn & breaking existing users or existing custom optimizers just for this individual feature
We instead decided to leave this disabled for now. We'll roll support
for this into a larger optimizer refactoring that solves a larger set
of issues.
This has now been implemented.

Related

FTRL optimizer in tensorflow seems not work well

Tried to training LR model on a large scale dataset via tensorflow with FTRL optimizer for a ctr task. tensorflow/sklearn auc and training/evaluation auc are OK. But performance in product is not good. I've tried to lower down the distributed level, but question can't be totally resolved. Any suggestions?
Found at least two reasons:
First is the underlying implementation is not exactly the same as the original paper. I don't know why they do this, explanation needed.
Second, the gradients used in updating weights are batch gradient, which means update the ps weights once per batch(very trivial in a modern distributed system but not suitable for the scenario in original paper), in a summary it does not utilize the training data record-wise. Personally the second is the key point.

Policy Gradient (REINFORCE) for invalid actions

Currently I'm trying to implement the REINFORCE policy gradient method (with neural network) for a game. Now obviously, there are certain actions that are invalid in certain states (can't fire the rocket launcher if you don't have one!).
I tried to mask the softmax outputs (action probability) so that is only samples from valid actions. This works fine (or so it seems), however after several iterations of training, these actions are no longer being chosen (all outputs for these nodes turn into 0 for certain input combination). Interestingly, certain action node (invalid action) seems to give 1 (100% probability) in these cases.
This is causing a huge problem since I will then have to resort to randomly choosing the action to perform, which obviously doesn't do well. Is there any other ways to deal with the problem?
P.S. I'm updating the network by setting the "label" as the chosen action node having the value of discounted reward, while the remaining actions to be 0, then doing a categorical_crossentropy in Keras.
I ended up using 2 different approaches, but they both follow the methodology of applying invalid action masks.
One is to use the mask after obtaining the softmax values from policy gradient, then normalize the probability of the remaining actions and sample from those.
The second method is applying the mask after the logit layer, which is simpler and seems to have a better effect (although I didn't do any quantitative measurement to prove this).

What is the reason for new function tf.nn.softmax_cross_entropy_with_logits_v2? [duplicate]

I am wondering why in Tensorflow version 1.5.0 and later, softmax_cross_entropy_with_logits_v2 defaults to backpropagating into both labels and logits. What are some applications/scenarios where you would want to backprop into labels?
I saw the github issue below asking the same question, you might want to follow it for future updates.
https://github.com/tensorflow/minigo/issues/37
I don't speak for the developers who made this decision, but I would surmise that they would do this by default because it is indeed used often, and for most application where you aren't backpropagating into the labels, the labels are a constant anyway and won't be adversely affected.
Two common uses cases for backpropagating into labels are:
Creating adversarial examples
There is a whole field of study around building adversarial examples that fool a neural network. Many of the approaches used to do so involve training a network, then holding the network fixed and backpropagating into the labels (original image) to tweak it (under some constraints usually) to produce a result that fools the network into misclassifying the image.
Visualizing the internals of a neural network.
I also recommend people watch the deepviz toolkit video on youtube, you'll learn a ton about the internal representations learned by a neural network.
https://www.youtube.com/watch?v=AgkfIQ4IGaM
If you continue digging into that and find the original paper you'll find that they also backpropagate into the labels to generate images which highly activate certain filters in the network in order to understand them.

skipping layer in backpropagation in keras

I am using Keras with tensorflow backend and I am curious whether it is possible to skip a layer during backpropagation but have it execute in the forward pass. So here is what I mean
Lambda (lambda x: a(x))
I want to apply a to x in the forward pass but I do not want a to be included in the derivation when the backprop takes place.
I was trying to find a solution bit I could not find anything. Can somebody help me out here?
UPDATE 2
In addition to tf.py_func, there is now an official guide on how to add a custom op.
UPDATE
See this question for an example of writing a custom op with gradient purely in Python without needing to rebuild anything. Note that there are some limitations to the method (see the documentation of tf.py_func).
Not exactly a solution to the problem, but still kind of an answer and too long for comments.
That's not even a Keras issue, but a TensorFlow one. Each op defines its own gradient computation that is used during backpropagation. I you really wanted to something like that, you would need to implement the op into TensorFlow yourself (no easy feat) and define the gradient that you want - because you can't have "no gradient", if anything it would be 1 or 0 (otherwise you can't go on with backpropagation). There is a tf.NoGradient function in TensorFlow which causes an op to propagate zeros, but I don't think it is meant to / can be used out of TensorFlow own internals.
UPDATE
Okay so a bit more of context. TensorFlow graphs are built of ops, which are implemented by kernels; this is basically a 1-to-1 mapping, except that there may be for example a CPU and a GPU kernel for an op, hence the differentiation. The set of ops supported by TensorFlow is usually static, I mean it can change with newer versions, but in principle you cannot add your own ops, because the ops of a graph go into the Protobuf serialized format, so if you made your own ops then you would not be able to share your graph. Ops are then defined at C++ level with the macro REGISTER_OP (see for example here), and kernels with REGISTER_KERNEL_BUILDER (see for example here).
Now, where do gradients come into play? Well, the funny thing is that the gradient of an op is not defined at C++ level; there are ops (and kernels) that implement the gradient of other ops (if you look at the previous files you'll find ops/kernels with the name ending in Grad), but (as far as I'm aware) these are not explicitly "linked" at this level. It seems that the associations between ops and their gradients is defined in Python, usually via tf.RegisterGradient or the aforementioned tf.NoGradient (see for example here, Python modules starting with gen_ are autogenerated with the help of the C++ macros); these registrations inform the backpropagation algorithm about how to compute the gradient of the graph.
So, how to actually work this out? Well, you need to create at least one op in C++ with the corresponding kernel/s implementing the computation that you want for your forward pass. Then, if the gradient computation that you want to use can be expressed with existing TensorFlow ops (which is most likely), you would just need to call tf.RegisterGradient in Python and do the computation there in "standard" TensorFlow. This is quite complicated, but the good news is it's possible, and there's even an example for it (although I think they kinda forgot the gradient registration part in that one)! As you will see, the process involves compiling the new op code into a library (btw I'm not sure if any of this may work on Windows) that is then loaded from Python (obviously this involves going through the painful process of manual compilation of TensorFlow with Bazel). A possibly more realistic example can be found in TensorFlow Fold, an extension of TensorFlow for structured data that register (as of one) one custom operation here through a macro defined here that calls REGISTER_OP, and then in Python it loads the library and register its gradient here through their own registration function defined here that simply calls tf.NotDifferentiable (another name for tf.NoGradient)
tldr: It is rather hard, but it can be done and there are even a couple of examples out there.
As mentioned in #jdehesa's comments. You can implement your function with an "alternative gradient". Forgive me if my math is not correct, but I think a derivative returning "1" would be the correct way to have no effect on the backpropagation while still passing the learning through. For how to construct it, see here. The example I cited goes further and allows you to construct an activation function from a python function. So in place of the spiky function, substitute your function a, and in place of his derivative d_spiky replace it with
def constant(x):
return 1
So on the forward pass, a is applied in the layer and the the backwards pass 1 is applied which should simply pass the weight adjustments through.
You can then just create an Activation layer in Keras using this function.

Which operations support automatic gradients?

I have a fairly complex quantisation layer in my network which contains, among other operations, tf.tile and tf.expand_dims ops. I noticed that my network did not train well. Looking at some debug output, I saw that the fully connected layer before this quantisation layer got zero gradients for its weights (I used optimizer.compute_gradients to determine this). Does this mean that what ever is before the quantisation layer does not update in training?
In general: How do I figure out which operations let gradients pass through and which do not? For instance, do the above mentionied tf.tile and tf.expand_dims let gradients pass through`
If there is an operation without gradients in your model you will get an error:
LookupError: No gradient defined for operation [...]
So your problem seems to be somewhere else, maybe you have a multiplication by zero somewhere which kills the gradients. There is not enough information in your question to find the real reason of your problem.
Edit:
I didn't directly answer the question which operations support automatic gradients.
It is not listed in the documentation and I think you can only see it by checking the source code or using the operation and see if you get the mentioned error when you try to optimize the model.
For tf.tile and tf.expand_dims there are gradients defined.