How to control reduction strategy for stateful metric in keras mirrored strategy - tensorflow

I use keras fit() method with custom metrics passed to model.
The metrics are stateful - i.e. are a subclass of a Metric, as described in https://keras.io/api/metrics/#as-subclasses-of-metric-stateful
When I run the code in a multi-gpu environment using a tf.distribute.MirroredStrategy() my metric code is called on every GPU separately with batch_size/no_of_gpus examples passed, which is reasonable to expect.
What happens next is that multiple scalars (one from every GPU) of the metric value need to be reduced to a single scalar, and what I get all the time is a sum reduction, while I would like to control that.
Keep in mind, that reduction parameter is the one of Loss in keras, and there is no such thing in the Metric class: https://github.com/tensorflow/tensorflow/blob/acbc065f8eb2ed05c7ab5c42b5c5bd6abdd2f91f/tensorflow/python/keras/metrics.py#L87
(the only crazy thing I tried was to inherit from a Mean class that is a subclass of a Metric but that didn't change anything)
reduction is mentioned in the metrics code, however this is a reduction over multiple accumulated values in a single metric object, and in multi-gpu setting - this is not the case, as every metric works in its own GPU and is somehow aggregated at the end.
The way I debugged it to understand this behaviour was - I was printing the shapes and the results inside update_state method of the metric. And then I looked at value of the metric in logs object in on_batch_end callback.
I tried looking at TF code, but couldn't find the place this is happening.
I would like to be able to control this behaviour - so either pick 'mean' or 'sum' for the metric, or at least know where it is being done in the code.
Edited: I guess this https://github.com/tensorflow/tensorflow/issues/39268 sheds some more light on this issue

I am facing the same problem as you (and that's why I found your question).
Seeing that it's been 15 days since you asked the question and there are no answers/comments yet, I thought I might share my temporary workaround.
Like you, I also think that a SUM reduction has been performed when combining progress over multiple GPUs. What I did is to pass the number of GPUs (e.g. given by the num_replicas_in_sync attribute of your tf.distribute strategy object) into the __init__(...) constructor of your sub-classed metric object, and use it to divide the return value in the results() method.
Potentially, you could also use tf.distribute.get_strategy() from within the metric object to make it "strategy aware", and use the information to decide how to modify the values in an ad hoc manner so that the SUM reduction will produce what you want.
I hope this helps for now, whether as a suggestion or as a confirmation that you're not alone on this.

When implementing the subclass of the Keras Metric class, you have to override the merge_state() function correctly. If you do not override this function, the default implementation will be used - which is a simple sum.
See: https://www.tensorflow.org/api_docs/python/tf/keras/metrics/Metric

Related

Calculate average and class-wise precision/recall for multiple classes in TensorFlow

I have a multiclass model with 4 classes. I have already implemented a callback able to calculate the precision/recall for each class and their macro average. But for some technical reason, I have to calculate them using the metrics mechanism.
I'm using TensorFlow 2 and Keras 2.3.0. I have already used the tensorflow.keras.metrics.Recall/Precision to get the class-wise metrics:
metrics_list = ['accuracy']
metrics_list.extend([Recall(class_id=i, name="recall_{}".format(label_names[i])) for i in range(n_category)])
metrics_list.extend([Precision(class_id=i, name="precision_{}".format(label_names[i])) for i in range(n_category)])
model = Model(...)
model.compile(...metrics=metrics_list)
However, this solution is not satisfying:
firstly, tensorflow.keras.metrics.Recall/Precision uses a threshold to define the affiliation to a class, while it should use argmax to define the most probable class, if class_id is defined
Secondly, I have to create 2 new metrics that would calculate the average over all classes, which itself requires to calculate the class-wise metrics. This is inelegant and inefficient to calculate twice the same thing.
Is there a way to create a class or a function that would calculate directly the class-wise and the average predicion/recall using the TensorFlow/Keras metrics logic?
Apparently I can easily obtain the confusion matrix using tf.math.confusion_matrix(). However, I do not see how to inject a list of scalar at once, instead of returning a single scalar.
Any comment is welcomed!
It occurs that in my very specific case, I can simply use CategoricalAccuracy() as unique metric because i'm using a batch_size=1. It this case, accuracy=recall=precision={1.|0.} for a batch. That only partially solve the problem. The best solution would be to update the confusion matrix using argmax at each batch end, then calculate the Precision/Recall based on that. I don't known how it is possible to do that yet, but it should be doable.

Does TensorFlow gradient compute derivative of functions with unknown dependency on decision variable

I appreciate if you can answer my questions or provide me with useful resources.
Currently, I am working on a problem that I need to do alternating optimization. So, consider we have two decision variables x and y. In the first step I take the derivative of loss function wrt. x (for fixed y) and update x. On the second step, I need to take the derivative wrt. y. The issue is x is dependent on y implicitly and finding the closed form of cost function in a way to show the dependency of x on y is not feasible, so the gradients of cost function wrt. y are unknown.
1) My first question is whether "autodiff" method in reverse mode used in TensorFlow works for these problems where we do not have an explicit form of cost function wrt to one variable and we need the derivatives? Actually, the value of cost function is known but the dependency on decision variable is unknown via math.
2) From a general view, if I define a node as a "tf.Variable" and have an arbitrary intractable function(intractable via computation by hand) of that variable that evolves through code execution, is it possible to calculate the gradients via "tf.gradients"? If yes, how can I make sure that it is implemented correctly? Can I check it using TensorBoard?
My model is too complicated but a simplified form can be considered in this way: suppose the loss function for my model is L(x). I can code L(x) as a function of "x" during the construction phase in tensorflow. However, I have also another variable "k" that is initialized to zero. The dependency of L(x) on "k" shapes as the code runs so my loss function is L(x,k), actually. And more importantly, "x" is a function of "k" implicitly. (all the optimization is done using GradientDescent). The problem is I do not have L(x,k) as a closed form function but I have the value of L(x,k) at each step. I can use "numerical" methods like FDSA/SPSA but they are not exact. I just need to make sure as you said there is a path between "k" and L(x,k)but I do not know how!
TensorFlow gradients only work when the graph connecting the x and the y when you're computing dy/dx has at least one path which contains only differentiable operations. In general if tf gives you a gradient it is correct (otherwise file a bug, but gradient bugs are rare, since the gradient for all differentiable ops is well tested and the chain rule is fairly easy to apply).
Can you be a little more specific about what your model looks like? You might also want to use eager execution if your forward complication is too weird to express as a fixed dataflow graph.

Pytorch register_hook to Keras implementation

Im trying to implement the following project into Tensorflow/Keras.
https://github.com/jacobgil/pytorch-pruning
Im having a hard time understanding what register_hook does? It can be found in finetune.py, row 66.
x.register_hook(self.compute_rank)
I've searched for clear explanations regarding this function and tried to find Keras-equivalents, without any luck. Do you have answers to these questions?
First things first, here's the documentation:
http://pytorch.org/docs/master/autograd.html#torch.autograd.Variable.register_hook
This allows you to register a method to a Variable that is called whenever the Variable's .grad is updated, i.e. in a backward pass, and takes the grad as input. The method can return a Variable that would replace the original .grad or None if you just want to read the gradients to do something else.
If you update the gradients this way, the nodes further down in the compute graph see the new updated gradient in the backward pass and will have their respective gradients calculated with the updated value.
I'm not a Tensorflow expert, but the RegisterGradient decorators (documentation) seem to be able to do the same, for an example see this answer.

Machine learning: why the cost function does not need to be derivable?

I was playing around with Tensorflow creating a customized loss function and this question about general machine learning arose to my head.
My understanding is that the optimization algorithm needs a derivable cost function to find/approach a minimum, however we can use functions that are non-derivable such as the absolute function (there is no derivative when x=0). A more extreme example, I defined my cost function like this:
def customLossFun(x,y):
return tf.sign(x)
and I expected an error when running the code, but it actually worked (it didn't learn anything but it didn't crash).
Am I missing something?
You're missing the fact that the gradient of the sign function is somewhere manually defined in the Tensorflow source code.
As you can see here:
def _SignGrad(op, _):
"""Returns 0."""
x = op.inputs[0]
return array_ops.zeros(array_ops.shape(x), dtype=x.dtype)
the gradient of tf.sign is defined to be always zero. This, of course, is the gradient where the derivate exists, hence everywhere but not in zero.
The tensorflow authors decided to do not check if the input is zero and throw an exception in that specific case
In order to prevent TensorFlow from throwing an error, the only real requirement is that you cost function evaluates to a number for any value of your input variables. From a purely "will it run" perspective, it doesn't know/care about the form of the function its trying to minimize.
In order for your cost function to provide you a meaningful result when TensorFlow uses it to train a model, it additionally needs to 1) get smaller as your model does better and 2) be bounded from below (i.e. it can't go to negative infinity). It's not generally necessary for it to be smooth (e.g. abs(x) has a kink where the sign flips). Tensorflow is always able to compute gradients at any location using automatic differentiation (https://en.wikipedia.org/wiki/Automatic_differentiation, https://www.tensorflow.org/versions/r0.12/api_docs/python/train/gradient_computation).
Of course, those gradients are of more use if you've chose a meaningful cost function isn't isn't too flat.
Ideally, the cost function needs to be smooth everywhere to apply gradient based optimization methods (SGD, Momentum, Adam, etc). But nothing's going to crash if it's not, you can just have issues with convergence to a local minimum.
When the function is non-differentiable at a certain point x, it's possible to get large oscillations if the neural network converges to this x. E.g., if the loss function is tf.abs(x), it's possible that the network weights are mostly positive, so the inference x > 0 at all times, so the network won't notice tf.abs. However, it's more likely that x will bounce around 0, so that the gradient is arbitrarily positive and negative. If the learning rate is not decaying, the optimization won't converge to the local minimum, but will bound around it.
In your particular case, the gradient is zero all the time, so nothing's going to change at all.
If it didn't learn anything, what have you gained ? Your loss function is differentiable almost everywhere but it is flat almost anywhere so the minimizer can't figure out the direction towards the minimum.
If you start out with a positive value, it will most likely be stuck at a random value on the positive side even though the minima on the left side are better (have a lower value).
Tensorflow can be used to do calculations in general and it provides a mechanism to automatically find the derivative of a given expression and can do so across different compute platforms (CPU, GPU) and distributed over multiple GPUs and servers if needed.
But what you implement in Tensorflow does not necessarily have to be a goal function to be minimized. You could use it e.g. to throw random numbers and perform Monte Carlo integration of a given function.

taking the gradient in Tensorflow, tf.gradient

I am using this function of tensorflow to get my function jacobian. Came across two problems:
The tensorflow documentation is contradicted to itself in the following two paragraph if I am not mistaken:
gradients() adds ops to the graph to output the partial derivatives of ys with respect to xs. It returns a list of Tensor of length len(xs) where each tensor is the sum(dy/dx) for y in ys.
Blockquote
Blockquote
Returns:
A list of sum(dy/dx) for each x in xs.
Blockquote
According to my test, it is, in fact, return a vector of len(ys) which is the sum(dy/dx) for each x in xs.
I do not understand why they designed it in a way that the return is the sum of the columns(or row, depending on how you define your Jacobian).
How can I really get the Jacobian?
4.In the loss, I need the partial derivative of my function with respect to input (x), but when I am optimizing with respect to the network weights, I define x as a placeholder whose value is fed later, and weights are variable, in this case, can I still define the symbolic derivative of function with respect to input (x)? and put it in the loss? ( which later when we optimize with respect to weights will bring second order derivative of the function.)
I think you are right and there is a typo there, it was probably meant to be "of length len(ys)".
For efficiency. I can't explain exactly the reasoning, but this seems to be a pretty fundamental characteristic of how TensorFlow handles automatic differentiation. See issue #675.
There is no straightforward way to get the Jacobian matrix in TensorFlow. Take a look at this answer and again issue #675. Basically, you need one call to tf.gradients per column/row.
Yes, of course. You can compute whatever gradients you want, there is no real difference between a placeholder and any other operation really. There are a few operations that do not have a gradient because it is not well defined or not implemented (in which case it will generally return 0), but that's all.