Calculate average and class-wise precision/recall for multiple classes in TensorFlow - tensorflow

I have a multiclass model with 4 classes. I have already implemented a callback able to calculate the precision/recall for each class and their macro average. But for some technical reason, I have to calculate them using the metrics mechanism.
I'm using TensorFlow 2 and Keras 2.3.0. I have already used the tensorflow.keras.metrics.Recall/Precision to get the class-wise metrics:
metrics_list = ['accuracy']
metrics_list.extend([Recall(class_id=i, name="recall_{}".format(label_names[i])) for i in range(n_category)])
metrics_list.extend([Precision(class_id=i, name="precision_{}".format(label_names[i])) for i in range(n_category)])
model = Model(...)
model.compile(...metrics=metrics_list)
However, this solution is not satisfying:
firstly, tensorflow.keras.metrics.Recall/Precision uses a threshold to define the affiliation to a class, while it should use argmax to define the most probable class, if class_id is defined
Secondly, I have to create 2 new metrics that would calculate the average over all classes, which itself requires to calculate the class-wise metrics. This is inelegant and inefficient to calculate twice the same thing.
Is there a way to create a class or a function that would calculate directly the class-wise and the average predicion/recall using the TensorFlow/Keras metrics logic?
Apparently I can easily obtain the confusion matrix using tf.math.confusion_matrix(). However, I do not see how to inject a list of scalar at once, instead of returning a single scalar.
Any comment is welcomed!

It occurs that in my very specific case, I can simply use CategoricalAccuracy() as unique metric because i'm using a batch_size=1. It this case, accuracy=recall=precision={1.|0.} for a batch. That only partially solve the problem. The best solution would be to update the confusion matrix using argmax at each batch end, then calculate the Precision/Recall based on that. I don't known how it is possible to do that yet, but it should be doable.

Related

How to control reduction strategy for stateful metric in keras mirrored strategy

I use keras fit() method with custom metrics passed to model.
The metrics are stateful - i.e. are a subclass of a Metric, as described in https://keras.io/api/metrics/#as-subclasses-of-metric-stateful
When I run the code in a multi-gpu environment using a tf.distribute.MirroredStrategy() my metric code is called on every GPU separately with batch_size/no_of_gpus examples passed, which is reasonable to expect.
What happens next is that multiple scalars (one from every GPU) of the metric value need to be reduced to a single scalar, and what I get all the time is a sum reduction, while I would like to control that.
Keep in mind, that reduction parameter is the one of Loss in keras, and there is no such thing in the Metric class: https://github.com/tensorflow/tensorflow/blob/acbc065f8eb2ed05c7ab5c42b5c5bd6abdd2f91f/tensorflow/python/keras/metrics.py#L87
(the only crazy thing I tried was to inherit from a Mean class that is a subclass of a Metric but that didn't change anything)
reduction is mentioned in the metrics code, however this is a reduction over multiple accumulated values in a single metric object, and in multi-gpu setting - this is not the case, as every metric works in its own GPU and is somehow aggregated at the end.
The way I debugged it to understand this behaviour was - I was printing the shapes and the results inside update_state method of the metric. And then I looked at value of the metric in logs object in on_batch_end callback.
I tried looking at TF code, but couldn't find the place this is happening.
I would like to be able to control this behaviour - so either pick 'mean' or 'sum' for the metric, or at least know where it is being done in the code.
Edited: I guess this https://github.com/tensorflow/tensorflow/issues/39268 sheds some more light on this issue
I am facing the same problem as you (and that's why I found your question).
Seeing that it's been 15 days since you asked the question and there are no answers/comments yet, I thought I might share my temporary workaround.
Like you, I also think that a SUM reduction has been performed when combining progress over multiple GPUs. What I did is to pass the number of GPUs (e.g. given by the num_replicas_in_sync attribute of your tf.distribute strategy object) into the __init__(...) constructor of your sub-classed metric object, and use it to divide the return value in the results() method.
Potentially, you could also use tf.distribute.get_strategy() from within the metric object to make it "strategy aware", and use the information to decide how to modify the values in an ad hoc manner so that the SUM reduction will produce what you want.
I hope this helps for now, whether as a suggestion or as a confirmation that you're not alone on this.
When implementing the subclass of the Keras Metric class, you have to override the merge_state() function correctly. If you do not override this function, the default implementation will be used - which is a simple sum.
See: https://www.tensorflow.org/api_docs/python/tf/keras/metrics/Metric

How to provide the Gekko Python with the first and second derivatives of the objective function?

I am trying to minimize the difference of a function with a data point over different time points. So the objective function is the sum of the squares of the difference between the model (my function) and the data points over different times.
My model has analytical first and second order derivatives. How can I provide these derivatives to Gekko Python?
There are several examples in the APMonitor webpage regarding parameter estimation. Please check the link below. It also provides the data and model that you can use for practice.
TCLab C - Parameter Estimation
You can also get the idea how to implement the higher order differential equations in GEKKO in the link below. You basically want to introduce additional variable which links the first derivative variable to the 2nd derivative variable. That way, you can collapse the higer order DE down into the multiple 1st order DEs.
Solve 2nd Order Differential Equation

Customized aggregation algorithm for gradient updates in tensorflow federated

I have been trying to implement this paper . Basically what I want to do is sum the per client loss and compare the same with previous epoch. Then for each constituent layer of the model compare the KL divergence between the weights of the server and the client model to get the layer specific parameter updates and then doing a softmax and to decide whether an adaptive update or a normal FedAvg approach is needed.
The algorithm is as follows-
FedMed
I tried to make use of the code here to build a custom federated avg process. I got the basic understanding that there are some tf.computations and some tff.computations which are involved. I get that I need to make changes in the orchestration logic in the run_one_round function and basically manipulate the client outputs to do adaptive averaging instead of the vanilla federated averaging. The client_update tf.computation function basically returns all the values that I need i.e the weights_delta (can be used for client based model weights), model_output(which can be used to calculate the loss).
But I am not sure where exactly I should make the changes.
#tff.federated_computation(federated_server_state_type,
federated_dataset_type)
def run_one_round(server_state, federated_dataset):
server_message = tff.federated_map(server_message_fn, server_state)
server_message_at_client = tff.federated_broadcast(server_message)
client_outputs = tff.federated_map(
client_update_fn, (federated_dataset, server_message_at_client))
weight_denom = client_outputs.client_weight
# todo
# instead of using tff.federated_mean I wish to do a adaptive aggregation based on the client_outputs.weights_delta and server_state model
round_model_delta = tff.federated_mean(
client_outputs.weights_delta, weight=weight_denom)
#client_outputs.weights_delta has all the client model weights.
#client_outputs.client_weight has the number of examples per client.
#client_outputs.model_output has the output of the model per client example.
I want to make use of the server model weights using server_state object.
I want to calculate the KL divergence between the weights of server model and each client's model per layer. Then use a relative weight to aggregate the client weights instead of vanilla federated averaging.
Instead of using tff.federated_mean I wish to use a different strategy basically an adaptive one based on the algorithm above.
So I needed some suggestions on how to go about implementing this.
Basically what I want to do is :
1)Sum all the values of client losses.
2)Calculate the KL divergence per layerbasis of all the clients with server and then determine whether to use adaptive optimization or FedAvg.
Also is there a way to manipulate this value as a python value which will be helpful for debugging purposes( I tried to use tf.print but that was not helpful either). Thanks!
Simplest option: compute weights for mean on clients
If I read the algorithm above correctly, we need only compute some weights for a mean on-the-fly. tff.federated_mean accepts an optional CLIENTS-placed weight argument, so probably the simplest option here is to compute the desired weights on the clients and pass them in to the mean.
This would look something like (assuming the appropriate definitions of the variables used below, which we will comment on):
#tff.federated_computation(...)
def round_function(...):
...
# We assume there is a tff.Computation training_fn that performs training,
# and we're calling it here on the correct arguments
trained_clients = tff.federated_map(training_fn, clients_placed_arguments)
# Next we assume there is a variable in-scope server_model,
# representing the 'current global model'.
global_model_at_clients = tff.federated_broadcast(server_model)
# Here we assume a function compute_kl_divergence, which takes
# two structures of tensors and computes the KL divergence
# (as a scalar) between them. The two arguments here are clients-placed,
# so the result will be as well.
kl_div_at_clients = tff.federated_map(compute_kl_divergence,
(global_model_at_clients, trained_clients))
# Perhaps we wish to not use raw KL divergence as the weight, but rather
# some function thereof; if so, we map a postprocessing function to
# the computed divergences. The result will still be clients-placed.
mean_weight = tff.federated_map(postprocess_divergence, kl_div_at_clients)
# Now we simply use the computed weights in the mean.
return tff.federated_mean(trained_clients, weight=mean_weight)
More flexible tool: tff.federated_reduce
TFF generally encourages algorithm developers to implement whatever they can 'in the aggregation', and as such exposes some highly customizable primitives like tff.federated_reduce, which allow you to run arbitrary TensorFlow "in the stream" between clients and server. If the above reading of the desired algorithm is incorrect and something more involved is needed, or you wish to flexibly experiment with totally different notions of aggregation (something TFF encourages and is designed to support), this may be the option for you.
In TFF's heuristic typing language, tff.federated_reduce has signature:
<{T}#CLIENTS, U, (<U, T> -> U)> -> U#SERVER
Meaning, federated_reduce take a value of type T placed at the clients, a 'zero' in a reduction algebra of type U, and a function accepting a U and a T and producing a U, and applies this function 'in the stream' on the way between clients and server, producing a U placed at the server. The function (<U, T> -> U) will be applied to the partially accumulated value U, and the 'next' element in the stream T (note however that TFF does not guarantee ordering of these values), returning another partially accumulated value U. The 'zero' should represent whatever 'partially accumulated' means over the empty set in your application; this will be the starting point of the reduction.
Application to this problem
The components
Your reduction function needs access to two pieces of data: the global model state and the result of training on a given client. This maps quite nicely to the type T. In this application, we will have something like:
T = <server_model=server_model_type, trained_model=trained_model_type>
These two types are likely to be the same, but may not necessarily be so.
Your reduction function will accept the partial aggregate, your server model and your client-trained model, returning a new partial aggregate. Here we will start assuming the same reading of the algorithm as above, that of a weighted mean with particular weights. Generally, the easiest way to compute a mean is to keep two accumulators, one for numerator and one for denominator. This will affect the choice of zero and reduction function below.
Your zero should contain a structure of tensors with value 0 mapping to the weights of your model--this will be the numerator. This would be generated for you if you had an aggregation like tff.federated_sum (as TFF knows what the zero should be), but for this case you'll have to get your hands on such a tensor yourself. This shouldn't be too hard with tf.nest.map_structure and tf.zeros_like.
For the denominator, we will assume we just need a scalar. TFF and TF are much more flexible than this--you could keep a per-layer or per-parameter denominator if desired--but for simplicity we will assume that we just want to divide by a single float in the end.
Therefore our type U will be something like:
U = <numerator=server_model_type, denominator=tf.float32>
Finally we come to our reduction function. It will be more or less a different composition of the same pieces above; we will make slightly tighter assumptions about them here (in particular, that all the local functions are tff.tf_computations--a technical assumption, arguably a bug on TFF). Our reduction function will be along the lines (assuming appropriate type aliases):
#tff.tf_computation(U, T)
def reduction(partial_accumulate, next_element):
kl_div = compute_kl_divergence(
next_element.server_model, next_element.trained_model)
weight = postprocess_divergence(kl_div)
new_numerator = partial_accumulate.numerator + weight * next_element.trained_model
new_denominator = partial_accumulate.denominator + weight
return collections.OrderedDict(
numerator=new_numerator, denominator=new_denominator)
Putting them together
The basic outline of a round will be similar to the above; but we have put more computation 'in the stream', and consequently there wil be less on the clients. We assume here the same variable definitions.
#tff.federated_computation(...)
def round_function(...):
...
trained_clients = tff.federated_map(training_fn, clients_placed_arguments)
global_model_at_clients = tff.federated_broadcast(server_model)
# This zip I believe is not necessary, but it helps my mental model.
reduction_arg = tff.federated_zip(
collections.OrderedDict(server_model=global_model_at_clients,
trained_model=trained_clients))
# We assume a zero as specified above
return tff.federated_reduce(reduction_arg,
zero,
reduction)

taking the gradient in Tensorflow, tf.gradient

I am using this function of tensorflow to get my function jacobian. Came across two problems:
The tensorflow documentation is contradicted to itself in the following two paragraph if I am not mistaken:
gradients() adds ops to the graph to output the partial derivatives of ys with respect to xs. It returns a list of Tensor of length len(xs) where each tensor is the sum(dy/dx) for y in ys.
Blockquote
Blockquote
Returns:
A list of sum(dy/dx) for each x in xs.
Blockquote
According to my test, it is, in fact, return a vector of len(ys) which is the sum(dy/dx) for each x in xs.
I do not understand why they designed it in a way that the return is the sum of the columns(or row, depending on how you define your Jacobian).
How can I really get the Jacobian?
4.In the loss, I need the partial derivative of my function with respect to input (x), but when I am optimizing with respect to the network weights, I define x as a placeholder whose value is fed later, and weights are variable, in this case, can I still define the symbolic derivative of function with respect to input (x)? and put it in the loss? ( which later when we optimize with respect to weights will bring second order derivative of the function.)
I think you are right and there is a typo there, it was probably meant to be "of length len(ys)".
For efficiency. I can't explain exactly the reasoning, but this seems to be a pretty fundamental characteristic of how TensorFlow handles automatic differentiation. See issue #675.
There is no straightforward way to get the Jacobian matrix in TensorFlow. Take a look at this answer and again issue #675. Basically, you need one call to tf.gradients per column/row.
Yes, of course. You can compute whatever gradients you want, there is no real difference between a placeholder and any other operation really. There are a few operations that do not have a gradient because it is not well defined or not implemented (in which case it will generally return 0), but that's all.

Tensorflow: opt.compute_gradients() returns values different from the weight difference of opt.apply_gradients()

Question: What is the most efficient way to get the delta of my weights in the most efficient way in a TensorFlow network?
Background: I've got the operators hooked up as follows (thanks to this SO question):
self.cost = `the rest of the network`
self.rmsprop = tf.train.RMSPropOptimizer(lr,rms_decay,0.0,rms_eps)
self.comp_grads = self.rmsprop.compute_gradients(self.cost)
self.grad_placeholder = [(tf.placeholder("float", shape=grad[1].get_shape(), name="grad_placeholder"), grad[1]) for grad in self.comp_grads]
self.apply_grads = self.rmsprop.apply_gradients(self.grad_placeholder)
Now, to feed in information, I run the following:
feed_dict = `training variables`
grad_vals = self.sess.run([grad[0] for grad in self.comp_grads], feed_dict=feed_dict)
feed_dict2 = `feed_dict plus gradient values added to self.grad_placeholder`
self.sess.run(self.apply_grads, feed_dict=feed_dict2)
The command of run(self.apply_grads) will update the network weights, but when I compute the differences in the starting and ending weights (run(self.w1)), those numbers are different than what is stored in grad_vals[0]. I figure this is because the RMSPropOptimizer does more to the raw gradients, but I'm not sure what, or where to find out what it does.
So back to the question: How do I get the delta on my weights in the most efficient way? Am I stuck running self.w1.eval(sess) multiple times to get the weights and calc the difference? Is there something that I'm missing with the tf.RMSPropOptimizer function.
Thanks!
RMSprop does not subtract the gradient from the parameters but use more complicated formula involving a combination of:
a momentum, if the corresponding parameter is not 0
a gradient step, rescaled non uniformly (on each coordinate) by the square root of the squared average of the gradient.
For more information you can refer to these slides or this recent paper.
The delta is first computed in memory by tensorflow in the slot variable 'momentum' and then the variable is updated (see the C++ operator).
Thus, you should be able to access it and construct a delta node with delta_w1 = self.rmsprop.get_slot(self.w1, 'momentum'). (I have not tried it yet.)
You can add the weights to the list of things to fetch each run call. Then you can compute the deltas outside of TensorFlow since you will have the iterates. This should be reasonably efficient, although it might incur an extra elementwise difference, but to avoid that you might have to hack around in the guts of the optimizer and find where it puts the update before it applies it and fetch that each step. Fetching the weights each call shouldn't do wasteful extra evaluations of part of the graph at least.
RMSProp does complicated scaling of the learning rate for each weight. Basically it divides the learning rate for a weight by a running average of the magnitudes of recent gradients of that weight.