How are function metrics aggregated over batches in tensorflow model validation? - tensorflow

In tensorflow tf.keras.Model.compile, you can pass a lambda y_true, y_pred: val function as a metric (though, it seems not documented), but I asked my self : "How does it aggregate it over the batches" ?
I searched the documentation, but I've found nowhere how it is done ?
By the way, I don't even know if it is an undefined behavior to do so and one should instead subclass the Metric class ? ( or at least provide the required methods).
Also, is it pertinent to pass a loss as a metric (and in this case, same question : how is it aggregated over the batches ? )

To understand "How does it aggregate (I'm assuming for display in the progress bar)", I suggest you check tf.keras.utils.Progbar. Aggregation over batches is done when you use model.fit, not model.compile.
Is using a lambda as a loss or metric undefined behaviour? No, if defined properely. If you do not write the lambda expression properly, TensorFlow will throw an Exception.
Is using a lambda as a loss or metric recommended? Nope. There is a reason TensorFlow provides separate classes for these. Extending inbuilt classes simplifies other parts of the pipeline, such as saving or loading models. It also makes the code much more readable.

It should just take the average over batches. I don't think it's undefined behavior.

Check out the "Creating Custom Metrics" section here. The metric you use (the lambda) is a stateless, and therefore, during training, it's
the average of the per-batch metric values for all batches seen during a given epoch.

Related

How to initialize the model with certain weights?

I am using the example "stateful_clients" in tensorflow-federated examples. I want to use my pretrained model weights to initialize the model. I use the function model.load_weights(init_weight). But it seems that it doesn't work. The validation accuracy in the first round is still low. How can I solve the problem?
def tff_model_fn():
"""Constructs a fully initialized model for use in federated averaging."""
keras_model = get_five_layers_cnn([28, 28, 1])
keras_model.load_weights(init_weight)
loss = tf.keras.losses.SparseCategoricalCrossentropy()
return stateful_fedavg_tf.KerasModelWrapper(keras_model,
test_data.element_spec, loss)
A quick primer on state and model weights in TFF
TFF takes a distinct perspective on state in machine learning, generally a consequence of its desire to be purely functional.
Usually in machine learning, a model is conceptually a function which takes data and produces a prediction. However, this notion is a little overloaded at times; does 'model' refer to a trained model (fitting the specification above), or an architecture which is parameterized by its parameters, and therefore needs to accept these parameters as an argument to be considered truly a 'function'? A conception somewhat in the middle is that of a 'stateful function', which I think tends to be what people intend to refer to when they use the term 'model'.
TFF standardizes on the latter understanding. For TFF, a 'model' is a function which accepts parameters along with data as an argument, producing a prediction. This is generally to avoid the notion of a stateful function, which is disallowed by a purely functional perspective (f(x) == f(x) should always be true, so f cannot have any state which affects its output).
On the code in question
I'm not super familiar with this portion of the TFF codebase; in particular I'm a little surprised at the behavior of the keras model wrapper, as usually TFF wants to serialize all logic into TFF-defined data structures as soon as possible (at least, this is how I think about it). Glancing at the code, it looks to me like it could work--but there have been exciting interactions between TFF and Keras in the past.
Briefly, here is how this path should be working:
The model function you define above is invoked while building the initialize computation, in a graph context; the logic to load weights (or assignment of the weights themselves, baked into the graph as a constant) would hopefully be serialized into the graph that TFF generates to represent initialize.
Upon calling iterative_process.initialize, you would find your desired weights populated in the appropriate attributes of the returned data structure. This would serve as your initial starting point for your iterative process, and you would be off to the races.
What I am suspicious of in the above is 1. TFF will silently invoke your model_fn in a TensorFlow graph context, resulting in non program-order semantics; if there is no control dependency between the assignment and the return value of your function (which there isn't in the code above, and in fact it is not obvious how to force this), the assignment may be skipped at initialize time. Therefore the state returned from initialize won't have your specified weights.
If this suspicion is true, the appropriate solution is to run this to run the weight loading logic directly in Python. TFF provides some utilities to help with this kind of thing, like tff.learning.state_with_new_model_weights. This would be used like:
state = iterative_process.initialize()
weights = tf.keras.load_weights(...) # No idea if this call is correct, probably not.
state_with_loaded_weights = tff.learning.state_with_new_model_weights(state, weights)
...
# continue on using state in the iterative process

How to get the value of loss in the update rule of chainer

I am trying to modify a class, SGDRule(optimizer.UpdateRule) of chainer, to make my original optimizer.
To achieve what I want, I need to get not only the gradient but also the loss.
Before generating the gradient by back propagation, a forward path, which yields the loss, must be done. I need that loss.
The problem is that I have to access the loss from the code of update_core_gpu(self, param) in the class.
I learned that the Classifier object has the loss as an attribute. However, I don't know how to access the object from the update rule.
As an alternative, I considered using the Reporter object that I can access from the code. I know how to pass a value to the reporter, but have no idea about how to get the loss that the reporter has.
Does anybody know how to get the current loss in the code of update rule?
If you are using a model that holds the loss, e.g. a Classifier, one simple but maybe less elegant way to do it would be to simply pass the model to the Optimizer and then to each UpdateRule when being constructed in Optimizer.create_update_rule. If you don't want to pass the model, you could probably pass a lambda that returns the loss from the model.
Another, probably a cleaner approach if sufficient for your case, would be to implement an optimizer hook, similar to how gradient clipping is implemented in Chainer. See https://github.com/chainer/chainer/blob/master/chainer/optimizer_hooks/gradient_clipping.py#L56. You can obtain the loss via opt.target.loss (opt.target being
your model) and for instance update the gradient prior to the optimization step.

What does stateful mean in tensorflow metrics in my case?

I don't really understand the explanation of a stateful metric here: Keras metrics with TF backend vs tensorflow metrics
Now, if I split my evaluation data in batches and for each batch I use tf.metrics.precision for the precision, does it mean that the previous variables (counter false positives etc. ) are used for the calculation in the next batch? That would be really bad, since I want the single evaluations for each batch (that is why I do the split!)
If this is the case how can I reset the variables for each batch.
I need the single values from each batch for a mean afterwards.
The reason why tf.metrics.Precision and the like (Recall, etc) store true/false positive is because we do not want to estimate them batch-wise (unlike Accuracy or Loss, etc). The original implementation of Precision in keras (noted, not tf.keras) did exactly what you described (single evaluations for each batch and then aggregate afterward) but was later removed in version 2.0.0 because this way of computing global metric is "more misleading than helpful" (https://github.com/keras-team/keras/issues/5794).
But you may still do what you want to do, you can subclass tf.metrics.Metric and implement the logic of Precision in update_state method. The Metric API doc on Tensorflow has an example of custom Metrics. https://www.tensorflow.org/api_docs/python/tf/keras/metrics/Metric
I hope this is helpful!

Is it a TensorFlow Best Practice for loss functions be callable (in the form of a function)? Other advantages besides Eager Execution compatibility?

Eager Execution requires any loss passed to any optimizer to be callabe, ie, in the form of a function.
So this is OK
def loss_function():
return tf.reduce_mean( tf.nn.sampled_softmax_loss(weights=softmax_weights, biases=softmax_biases, inputs=averaged_embeds,
labels=train_labels, num_sampled=num_sampled, num_classes=vocabulary_size))
but this is NOT ok
loss = tf.reduce_mean(
tf.nn.sampled_softmax_loss(weights=softmax_weights, biases=softmax_biases, inputs=averaged_embeds,
labels=train_labels, num_sampled=num_sampled, num_classes=vocabulary_size))
And will raise this error
`loss` passed to Optimizer.compute_gradients should be a function when eager execution is enabled.
I've noticed that usually when a new Tensorflow feature requires some sort of practice, there are usually multiple benefits associated with that requirement even when not using that feature. For example, Eager execution also requires get_variable, to define variables. And as far as I can tell, there is no reason to use variable over get_variable
So are there any other advantages of having a loss function be callable outside of using Eager Execution?
On the loss issue (Akshay summed up the variable issue in the question comments):
When graph building, Tensors can behave like futures, and this is the way the Optimizer methods use it. Then this future can be evaluated many times in the training loop and the evaluation monitored to produce the gradient (which works by adding a bunch of ops to the graph which look at intermediate values).
Tensors are literal values when executing eagerly, so to get the same behavior it needs to be wrapped in a function (which is then run with a tf.GradientTape active to trace the gradient).
As for whether it's a best practice, new TensorFlow code being graph/eager agnostic is a good thing, although writing agnostic training loops is often more effort than it's worth. In general using Tensors as futures can be a bit harder to reason about.

What is the difference between tf.gradients and tf.train.Optimizer.compute_gradient?

It seems that tf.gradients allows to compute also Jacobians, i.e. the partial derivatives of each entry of one tensor wrt. each entry of another tensor, while tf.train.Optimizer.compute_gradient only computes actual gradients, e.g. the partial derivatives of a scalar value wrt. each entry of a particular tensor or wrt. one particular scalar. Why is there a separate function if tf.gradients also implements that functionality?
tf.gradients does not allow you to compute Jacobians, it aggregates the gradients of each input for every output (something like the summation of each column of the actual Jacobian matrix). In fact, there is no "good" way of computing Jacobians in TensorFlow (basically you have to call tf.gradients once per output, see this issue).
With respect to tf.train.Optimizer.compute_gradients, yes, its result is basically the same, but taking care of some details automatically and with slightly more convenient output format. If you look at the implementation, you will see that, at its core, is a call to tf.gradients (in this case aliased to gradients.gradients), but it is useful for optimizer implementations to have the surrounding logic already implemented. Also, having it as a method allows for extensible behaviour in subclasses, either to implement some kind of optimization strategy (not very likely at the compute_gradients step, really) or for auxiliary purposes, like tracing or debugging.