I'll describe what I'm trying to do, and hopefully, somebody can tell me what design pattern this is or point out a better alternative.
I have a method doing a bunch of complicated stuff that involves an approximation. It is possible to compute the result without the approximation but this involves more work.
What I want to do is get out some comparisons related to the inner workings of the implementation. My idea is to pass in an object that would do this extra work and store the info about the comparisons.
I think I want to go from:
class Classifier(object):
...
def classify(self, data):
# Do some approximate stuff with the data and other
# instance members.
to
class TruthComparer(object):
def compute_and_store_stats(self, data, accessory_data):
# Do the exact computation, compare to the approximation,
# and store it.
def get_comparison_stats(self):
# Return detailed and aggregate information about the
# differences between the truth and the approximation.
class Classifier(object):
...
def classify(self, data, truth_comparer=None):
# Do some approximate stuff with the data and other
# instance members.
# Optionally, do exact stuff with the data and other
# instance members, storing info about differences
# between the exact computation and the approximation
# in the truth_comparer
if truth_comparer is not None:
truth_comparer.compute_and_store_stats(data,
[self._index, self._model],
intermediate_approximation)
The reason I don't want to do those comparisons inline within the classify method is that I don't think it fits the job of that method or object to do those comparisons.
So, what design pattern, if any, is this? Can you suggest an alternative?
You could use the Decorator Pattern. You define the Classifier interface and use a TruthComparerDecorator that inherits from Classifier. A TruthComparer, the decorator, takes a Classifier as input, computes the approximation with this classifier instance and then runs the compute_and_store_stats method. Using this pattern you a classifier does not need to know anything about a TruthComparer. In the end, a TruthComparer is a Classifier but does some more things. In Java this could look like:
public interface Classifier {
void classify(Data data);
}
public abstract class TruthComparer implements Classifier {
private Classifier classifier;
public TruthComparerDecorator(Classifier classifier) {
this.classifier = classifier;
}
public void classify(Data data) {
classifier.classify(data);
computeAndStoreStats(data);
}
public abstract void computeAndStoreStats(Data data);
}
My problem with your proposed change is that it doesn't seem right that the classifier requests the compute and store stats part. The classifier shouldn't really be concerned about that. Ideally, it shouldn't even know that the TruthComparer exists.
I'd suggest you really want two methods on Classifier: classify/classify_exact. Classify returns the approximate result; classify_exact returns the exact result. Instead of passing the TruthComparer as a parameter, give TruthComparer the two classifications and let it do its thing.
That way you reduce the number of objects your Classifier has to deal with (lower coupling), and I think makes what is going on clearer.
Related
I have been trying to implement this paper . Basically what I want to do is sum the per client loss and compare the same with previous epoch. Then for each constituent layer of the model compare the KL divergence between the weights of the server and the client model to get the layer specific parameter updates and then doing a softmax and to decide whether an adaptive update or a normal FedAvg approach is needed.
The algorithm is as follows-
FedMed
I tried to make use of the code here to build a custom federated avg process. I got the basic understanding that there are some tf.computations and some tff.computations which are involved. I get that I need to make changes in the orchestration logic in the run_one_round function and basically manipulate the client outputs to do adaptive averaging instead of the vanilla federated averaging. The client_update tf.computation function basically returns all the values that I need i.e the weights_delta (can be used for client based model weights), model_output(which can be used to calculate the loss).
But I am not sure where exactly I should make the changes.
#tff.federated_computation(federated_server_state_type,
federated_dataset_type)
def run_one_round(server_state, federated_dataset):
server_message = tff.federated_map(server_message_fn, server_state)
server_message_at_client = tff.federated_broadcast(server_message)
client_outputs = tff.federated_map(
client_update_fn, (federated_dataset, server_message_at_client))
weight_denom = client_outputs.client_weight
# todo
# instead of using tff.federated_mean I wish to do a adaptive aggregation based on the client_outputs.weights_delta and server_state model
round_model_delta = tff.federated_mean(
client_outputs.weights_delta, weight=weight_denom)
#client_outputs.weights_delta has all the client model weights.
#client_outputs.client_weight has the number of examples per client.
#client_outputs.model_output has the output of the model per client example.
I want to make use of the server model weights using server_state object.
I want to calculate the KL divergence between the weights of server model and each client's model per layer. Then use a relative weight to aggregate the client weights instead of vanilla federated averaging.
Instead of using tff.federated_mean I wish to use a different strategy basically an adaptive one based on the algorithm above.
So I needed some suggestions on how to go about implementing this.
Basically what I want to do is :
1)Sum all the values of client losses.
2)Calculate the KL divergence per layerbasis of all the clients with server and then determine whether to use adaptive optimization or FedAvg.
Also is there a way to manipulate this value as a python value which will be helpful for debugging purposes( I tried to use tf.print but that was not helpful either). Thanks!
Simplest option: compute weights for mean on clients
If I read the algorithm above correctly, we need only compute some weights for a mean on-the-fly. tff.federated_mean accepts an optional CLIENTS-placed weight argument, so probably the simplest option here is to compute the desired weights on the clients and pass them in to the mean.
This would look something like (assuming the appropriate definitions of the variables used below, which we will comment on):
#tff.federated_computation(...)
def round_function(...):
...
# We assume there is a tff.Computation training_fn that performs training,
# and we're calling it here on the correct arguments
trained_clients = tff.federated_map(training_fn, clients_placed_arguments)
# Next we assume there is a variable in-scope server_model,
# representing the 'current global model'.
global_model_at_clients = tff.federated_broadcast(server_model)
# Here we assume a function compute_kl_divergence, which takes
# two structures of tensors and computes the KL divergence
# (as a scalar) between them. The two arguments here are clients-placed,
# so the result will be as well.
kl_div_at_clients = tff.federated_map(compute_kl_divergence,
(global_model_at_clients, trained_clients))
# Perhaps we wish to not use raw KL divergence as the weight, but rather
# some function thereof; if so, we map a postprocessing function to
# the computed divergences. The result will still be clients-placed.
mean_weight = tff.federated_map(postprocess_divergence, kl_div_at_clients)
# Now we simply use the computed weights in the mean.
return tff.federated_mean(trained_clients, weight=mean_weight)
More flexible tool: tff.federated_reduce
TFF generally encourages algorithm developers to implement whatever they can 'in the aggregation', and as such exposes some highly customizable primitives like tff.federated_reduce, which allow you to run arbitrary TensorFlow "in the stream" between clients and server. If the above reading of the desired algorithm is incorrect and something more involved is needed, or you wish to flexibly experiment with totally different notions of aggregation (something TFF encourages and is designed to support), this may be the option for you.
In TFF's heuristic typing language, tff.federated_reduce has signature:
<{T}#CLIENTS, U, (<U, T> -> U)> -> U#SERVER
Meaning, federated_reduce take a value of type T placed at the clients, a 'zero' in a reduction algebra of type U, and a function accepting a U and a T and producing a U, and applies this function 'in the stream' on the way between clients and server, producing a U placed at the server. The function (<U, T> -> U) will be applied to the partially accumulated value U, and the 'next' element in the stream T (note however that TFF does not guarantee ordering of these values), returning another partially accumulated value U. The 'zero' should represent whatever 'partially accumulated' means over the empty set in your application; this will be the starting point of the reduction.
Application to this problem
The components
Your reduction function needs access to two pieces of data: the global model state and the result of training on a given client. This maps quite nicely to the type T. In this application, we will have something like:
T = <server_model=server_model_type, trained_model=trained_model_type>
These two types are likely to be the same, but may not necessarily be so.
Your reduction function will accept the partial aggregate, your server model and your client-trained model, returning a new partial aggregate. Here we will start assuming the same reading of the algorithm as above, that of a weighted mean with particular weights. Generally, the easiest way to compute a mean is to keep two accumulators, one for numerator and one for denominator. This will affect the choice of zero and reduction function below.
Your zero should contain a structure of tensors with value 0 mapping to the weights of your model--this will be the numerator. This would be generated for you if you had an aggregation like tff.federated_sum (as TFF knows what the zero should be), but for this case you'll have to get your hands on such a tensor yourself. This shouldn't be too hard with tf.nest.map_structure and tf.zeros_like.
For the denominator, we will assume we just need a scalar. TFF and TF are much more flexible than this--you could keep a per-layer or per-parameter denominator if desired--but for simplicity we will assume that we just want to divide by a single float in the end.
Therefore our type U will be something like:
U = <numerator=server_model_type, denominator=tf.float32>
Finally we come to our reduction function. It will be more or less a different composition of the same pieces above; we will make slightly tighter assumptions about them here (in particular, that all the local functions are tff.tf_computations--a technical assumption, arguably a bug on TFF). Our reduction function will be along the lines (assuming appropriate type aliases):
#tff.tf_computation(U, T)
def reduction(partial_accumulate, next_element):
kl_div = compute_kl_divergence(
next_element.server_model, next_element.trained_model)
weight = postprocess_divergence(kl_div)
new_numerator = partial_accumulate.numerator + weight * next_element.trained_model
new_denominator = partial_accumulate.denominator + weight
return collections.OrderedDict(
numerator=new_numerator, denominator=new_denominator)
Putting them together
The basic outline of a round will be similar to the above; but we have put more computation 'in the stream', and consequently there wil be less on the clients. We assume here the same variable definitions.
#tff.federated_computation(...)
def round_function(...):
...
trained_clients = tff.federated_map(training_fn, clients_placed_arguments)
global_model_at_clients = tff.federated_broadcast(server_model)
# This zip I believe is not necessary, but it helps my mental model.
reduction_arg = tff.federated_zip(
collections.OrderedDict(server_model=global_model_at_clients,
trained_model=trained_clients))
# We assume a zero as specified above
return tff.federated_reduce(reduction_arg,
zero,
reduction)
Let's say my model has two classes Class 1 and Class 2. Both Class 1 and Class 2 has a equal amount of training and testing data. But I want to penalize the loss of the Class 1 more than Class 2, so that one class has a fewer number of False Positives than the other (I want the model to perform better for one class than the other).
How do I achieve this in Tensorflow?
The thing you are looking for is probably
weighted_cross_entropy.
It is giving a very closely related contextual information, similar to #Sazzad 's answer, but specific to TensorFlow. To quote the documentation:
This is like sigmoid_cross_entropy_with_logits() except that
pos_weight, allows one to trade off recall and precision by up- or
down-weighting the cost of a positive error relative to a negative
error.
It accepts an additional argument pos_weights. Also note that this is only for binary classification, which is the case in the example you described. If there might be other classes besides the two, this would not work.
If I understand your question correctly, this is not a tensorflow concept. you can write your own. for binary classification, the loss is something like this
loss = ylogy + (1-y)log(1-y)
Here class 0 and class 1 have the same weight in the loss. So you can give more give more weight to some portion. for example,
loss = 5 * ylogy + (1-y)log(1-y)
Hope it answers your question.
I am trying to implement a custom op in TensorFlow that represents a computationally heavy transfer function computed in C++ using Eigen on GPU. I would like to accelerate the computation of the gradient (also in C++ for speed) of the op by re-using some of the intermediate values obtained while computing its output.
In the source code of tensorflow/core/kernels/cwise_ops_gradients.h we see that many functions already do that to some extent by re-using the output of the op to compute its derivative. Here is the example of the sigmoid:
template <typename T>
struct scalar_sigmoid_gradient_op {
EIGEN_EMPTY_STRUCT_CTOR(scalar_sigmoid_gradient_op)
EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE const T
operator()(const T& output, const T& output_gradient) const {
return output_gradient * output * (T(1) - output);
}
However, I don't see how I can access something else than just the output, for example some other values I stored during the forward pass, to accelerate the computation of my derivative.
I've thought about adding a second output to my op, with all the data required for the derivative, and use it for the computation of the gradient of the actual output, but I've not managed to make it work yet. I'm not sure if it could work in principle.
Another approach I imagined is to manually modify the full graph (forward and backprop) to shortcut an output from the op directly towards its derivative block. I'm not sure how to do it.
Otherwise, there may be a data storage scheme I'm not aware of and that would allow me to store data in the forward pass of an op and retrieve it during gradient computation.
Thank you for your attention, I would greatly appreciate any ideas.
D
I have one predictor which is dominating my models, I still want to include it, but I want to down-weight its importance in the final model. Is there a good (sci)pythonic way to do this? I'm thinking maybe defining a custom PenaltyTransformer which introduces random noise into the variable, something like this:
class PenaltyTransformer(BaseEstimator,TransformerMixin):
def __init__(self, columns, scale=0.1):
self.scale = scale
self.columns = columns
def transform(self, X):
X[:,self.columns] += np.random.normal(loc=0, scale=self.scale, size=X[:,self.columns].shape)
return X
... does this make sense?
Without knowing anything about your application, it's hard to give a definitive answer on what you should do. A few options I can see:
Your noising approach in the question might be fine
You might use a model with higher bias and lower variance like a regularized linear model (I'm assuming you're doing GBM, RF, or something similar from the use of the term "importance")
You might exclude the highly predictive feature entirely
You might build a model that excludes the highly predictive feature entirely and then combines the resulting score with that feature in some way
You might also just want to accept that the strong feature is going to be the dominant factor in your model
Is there a way in TensorFlow to find out if two graphs have the same structure ?
I am designing an abstract class whose individual instances are expected to represent different architectures. I have provided an abc.abstractmethod get() which defines the graph. However, I also want to be able to load a pre-trained graph from disk. I want to check if the pre-trained graph has the same definition as the one mentioned in the get() method of a concrete class.
How may I achieve this structural comparison ?
You can get graph definition of current graph as str(tf.get_default_graph().as_graph_def()) and compare for exact equality against your previous result.
Also, TensorFlow tests have more advanced function EqualGraphDef which can tell that two graphs are equal even when graph format has changed, ie, if actual and expected as GraphDef proto objects, you could do
from tensorflow.python import pywrap_tensorflow
diff = pywrap_tensorflow.EqualGraphDefWrapper(actual.SerializeToString(),
expected.SerializeToString())
assert not diff