Local Model performance in Tensorflow Federated - tensorflow

I am implementing federated learning through tensorflow-federated. The tutorial and all other material available compared the accuracy of the federated (global) model after each communication round. Is there a way I can compute the accuracy of each local model to compare against federated (global) model.
Total number of clients: 15
For each communication round: Local vs Federated Model performance

I dont know how you can achieve this with tff.learning.build_federated_averaging_process but I recommend you to take a look at this simple fedavg implementation. Here you can use test_data -the same evaluation dataset you use in server model- for each client. I would suggest you to do client_test_datasets = [test_data for x in sampled_train_ids]. Then pass this as iterative_process.next(server_state, sampled_train_data, client_test_datasets ). Here you need to change the signatures for run_one_round and client_update_fn in the simple_fedavg_tff.py. In each case the signatures from test datasets shall be same as the ones for training dataset. Dont forget actually passing appropriate test datasets as input to each. Now move on to simple_fedavg_tf.py and change your client_update. Here you basically need to write evaluation very similar to one done for server model. Thereafter print the evaluation results if you wish or change the outputs for each level (tf.function, tff.tf_computation, and tff.federated_computation) and pass the eval results as output. If you go this way dont forget to update the output of iterative_process.next
edit: I assumed you wanted the accuracy of clients when the test dataset is the same as server test dataset.


Customized aggregation algorithm for gradient updates in tensorflow federated

I have been trying to implement this paper . Basically what I want to do is sum the per client loss and compare the same with previous epoch. Then for each constituent layer of the model compare the KL divergence between the weights of the server and the client model to get the layer specific parameter updates and then doing a softmax and to decide whether an adaptive update or a normal FedAvg approach is needed.
The algorithm is as follows-
I tried to make use of the code here to build a custom federated avg process. I got the basic understanding that there are some tf.computations and some tff.computations which are involved. I get that I need to make changes in the orchestration logic in the run_one_round function and basically manipulate the client outputs to do adaptive averaging instead of the vanilla federated averaging. The client_update tf.computation function basically returns all the values that I need i.e the weights_delta (can be used for client based model weights), model_output(which can be used to calculate the loss).
But I am not sure where exactly I should make the changes.
def run_one_round(server_state, federated_dataset):
server_message = tff.federated_map(server_message_fn, server_state)
server_message_at_client = tff.federated_broadcast(server_message)
client_outputs = tff.federated_map(
client_update_fn, (federated_dataset, server_message_at_client))
weight_denom = client_outputs.client_weight
# todo
# instead of using tff.federated_mean I wish to do a adaptive aggregation based on the client_outputs.weights_delta and server_state model
round_model_delta = tff.federated_mean(
client_outputs.weights_delta, weight=weight_denom)
#client_outputs.weights_delta has all the client model weights.
#client_outputs.client_weight has the number of examples per client.
#client_outputs.model_output has the output of the model per client example.
I want to make use of the server model weights using server_state object.
I want to calculate the KL divergence between the weights of server model and each client's model per layer. Then use a relative weight to aggregate the client weights instead of vanilla federated averaging.
Instead of using tff.federated_mean I wish to use a different strategy basically an adaptive one based on the algorithm above.
So I needed some suggestions on how to go about implementing this.
Basically what I want to do is :
1)Sum all the values of client losses.
2)Calculate the KL divergence per layerbasis of all the clients with server and then determine whether to use adaptive optimization or FedAvg.
Also is there a way to manipulate this value as a python value which will be helpful for debugging purposes( I tried to use tf.print but that was not helpful either). Thanks!
Simplest option: compute weights for mean on clients
If I read the algorithm above correctly, we need only compute some weights for a mean on-the-fly. tff.federated_mean accepts an optional CLIENTS-placed weight argument, so probably the simplest option here is to compute the desired weights on the clients and pass them in to the mean.
This would look something like (assuming the appropriate definitions of the variables used below, which we will comment on):
def round_function(...):
# We assume there is a tff.Computation training_fn that performs training,
# and we're calling it here on the correct arguments
trained_clients = tff.federated_map(training_fn, clients_placed_arguments)
# Next we assume there is a variable in-scope server_model,
# representing the 'current global model'.
global_model_at_clients = tff.federated_broadcast(server_model)
# Here we assume a function compute_kl_divergence, which takes
# two structures of tensors and computes the KL divergence
# (as a scalar) between them. The two arguments here are clients-placed,
# so the result will be as well.
kl_div_at_clients = tff.federated_map(compute_kl_divergence,
(global_model_at_clients, trained_clients))
# Perhaps we wish to not use raw KL divergence as the weight, but rather
# some function thereof; if so, we map a postprocessing function to
# the computed divergences. The result will still be clients-placed.
mean_weight = tff.federated_map(postprocess_divergence, kl_div_at_clients)
# Now we simply use the computed weights in the mean.
return tff.federated_mean(trained_clients, weight=mean_weight)
More flexible tool: tff.federated_reduce
TFF generally encourages algorithm developers to implement whatever they can 'in the aggregation', and as such exposes some highly customizable primitives like tff.federated_reduce, which allow you to run arbitrary TensorFlow "in the stream" between clients and server. If the above reading of the desired algorithm is incorrect and something more involved is needed, or you wish to flexibly experiment with totally different notions of aggregation (something TFF encourages and is designed to support), this may be the option for you.
In TFF's heuristic typing language, tff.federated_reduce has signature:
<{T}#CLIENTS, U, (<U, T> -> U)> -> U#SERVER
Meaning, federated_reduce take a value of type T placed at the clients, a 'zero' in a reduction algebra of type U, and a function accepting a U and a T and producing a U, and applies this function 'in the stream' on the way between clients and server, producing a U placed at the server. The function (<U, T> -> U) will be applied to the partially accumulated value U, and the 'next' element in the stream T (note however that TFF does not guarantee ordering of these values), returning another partially accumulated value U. The 'zero' should represent whatever 'partially accumulated' means over the empty set in your application; this will be the starting point of the reduction.
Application to this problem
The components
Your reduction function needs access to two pieces of data: the global model state and the result of training on a given client. This maps quite nicely to the type T. In this application, we will have something like:
T = <server_model=server_model_type, trained_model=trained_model_type>
These two types are likely to be the same, but may not necessarily be so.
Your reduction function will accept the partial aggregate, your server model and your client-trained model, returning a new partial aggregate. Here we will start assuming the same reading of the algorithm as above, that of a weighted mean with particular weights. Generally, the easiest way to compute a mean is to keep two accumulators, one for numerator and one for denominator. This will affect the choice of zero and reduction function below.
Your zero should contain a structure of tensors with value 0 mapping to the weights of your model--this will be the numerator. This would be generated for you if you had an aggregation like tff.federated_sum (as TFF knows what the zero should be), but for this case you'll have to get your hands on such a tensor yourself. This shouldn't be too hard with tf.nest.map_structure and tf.zeros_like.
For the denominator, we will assume we just need a scalar. TFF and TF are much more flexible than this--you could keep a per-layer or per-parameter denominator if desired--but for simplicity we will assume that we just want to divide by a single float in the end.
Therefore our type U will be something like:
U = <numerator=server_model_type, denominator=tf.float32>
Finally we come to our reduction function. It will be more or less a different composition of the same pieces above; we will make slightly tighter assumptions about them here (in particular, that all the local functions are tff.tf_computations--a technical assumption, arguably a bug on TFF). Our reduction function will be along the lines (assuming appropriate type aliases):
#tff.tf_computation(U, T)
def reduction(partial_accumulate, next_element):
kl_div = compute_kl_divergence(
next_element.server_model, next_element.trained_model)
weight = postprocess_divergence(kl_div)
new_numerator = partial_accumulate.numerator + weight * next_element.trained_model
new_denominator = partial_accumulate.denominator + weight
return collections.OrderedDict(
numerator=new_numerator, denominator=new_denominator)
Putting them together
The basic outline of a round will be similar to the above; but we have put more computation 'in the stream', and consequently there wil be less on the clients. We assume here the same variable definitions.
def round_function(...):
trained_clients = tff.federated_map(training_fn, clients_placed_arguments)
global_model_at_clients = tff.federated_broadcast(server_model)
# This zip I believe is not necessary, but it helps my mental model.
reduction_arg = tff.federated_zip(
# We assume a zero as specified above
return tff.federated_reduce(reduction_arg,

Learning parameters of each simulated device

Does tensorflow-federated support assigning different hyper-parameters(like batch-size or learning rate) for different simulated devices?
Currently, you may find this a bit unnatural, but yes, such a thing is possible.
One approach to doing this that is supported today is to have each client take its local learning rate as a top-level parameter, and use this in the training. A dummy example here would be (sliding the model parameter in the computations below) something along the lines of
#tff.tf_computation(tff.SequenceTyoe(...), tf.float32)
def train_with_learning_rate(ds, lr):
# run training with `tf.data.Dataset` ds and learning rate lr
#tff.federated_computation(tff.FederatedType([tff.SequenceType(...), tf.float32])
def run_one_round(datasets_and_lrs):
return tff.federated_mean(
tff.federated_map(train_with_learning_rate, datasets_and_lrs))
Invoking the federated computation here with a list of tuples with the first element of the tuple representing the clients data and the second element representing the particular client's learning rate, would give what you want.
Such a thing requires writing custom federated computations, and in particular likely defining your own IterativeProcess. A similar iterative process definition was recently open sourced here, link goes to the relevant local client function definition to allow for learning rate scheduling on the clients by taking an extra integer parameter representing the round number, it is likely a good place to look.

How to test a machine learning model?

I want to develop a framework(for QA testing purpose) that validates a machine learning model. I had a lot of discussions with my peers and read articles from the google.
Most of the discussions or articles are telling machine learning model will evolve with the test data that we provide. correct me if I'm wrong.
What is the possibility of developing a framework that validates the machine learning model will give accurate results?
Few ways to test the model from the articles I read: Split and Multi-split technique, Metamorphic testing
Please also suggest any other approaches
QA testing of ML-based software requires additional, and rather unconventional, tests because oftentimes their outputs for a given set of inputs are not defined, deterministic, or known a priori and they produce approximations rather than exact results.
QA may be designed to test against:
naive but predictable benchmark methods: the average method in forecasting, the class-frequency-based classifier in classification, etc.
sanity checks (the outputs being feasible/rational): e.g., is the predicted age positive?
preset objective acceptance levels: e.g., is its AUCROC > 0.5?
extreme/boundary cases: e.g., thunderstorm conditions for a weather forecast model.
bias-variance tradeoff: what is its performance on in-sample and out-of-sample data? K-Fold cross-validation is useful here.
the model itself: is the coefficient of variation of its performance measure (e.g., AUCROC) from n runs on the same data for same/random train and test partitioning within a reasonable bound?
Some of these tests need performance measures. Here is a comprehensive library of them.
I think the data flow is, actually, the one that needs to be tested here such as raw input, manipulation, test output and predictions. For example, if you have a simple linear model you actually want to test the predictions produced from that model instead of the coefficients of the model. So, maybe, the high level steps are summarized as below;
Raw Input: Does the raw input make sense? Before you start manipulating, you need to be sure the raw data values are within the expected limits. For example, if you normally see 5-10% NA rate in some data, having 95% NA rate in a new batch might be an indicator that something is wrong.
Train/Predict Ready Input: Either you train a new model or feeding new data into a already trained model for prediction, you probably want to be sure that manipulated data makes sense, too. Some ML algorithms are delicate to data anomalies. You don't want to predict a credit score around thousands just because you have some data anomalies in the input.
Model Success: By this time, you should have some idea about your model success. So, you can measure the model's performance on a new test data. You can also check train and test score if they are not significantly different (i.e. Overfitting). If you're retraining, you can compare with the previous training scores. Or, you can separate some test set and compare its score.
Predictions: Finally, you need to be sure your final output makes sense before delivering to production/clients. For example, if you're revenue forecasting for a very small shop, the daily revenue predictions can't be million dollars or some negative amounts.
Full disclosure, I wrote a small Python package for this. You can check here or download as below,
pip install mlqa

In distributed tensorflow, how to write to summary from workers as well

I am using google cloud ml distributed sample for training a model on a cluster of computers. Input and output (ie rfrecords, checkpoints, tfevents) are all on gs:// (google storage)
Similarly to the distributed sample, I use an evaluation step that is called at the end, and the result is written as a summary, in order to use parameter hypertuning / either within Cloud ML, or using my own stack of tools.
But rather than performing a single evaluation on a large batch of data, I am running several evaluation steps, in order to retrieve statistics on the performance criteria, because I don't want to limited to a single value. I want to get information regarding the performance interval. In particular, the variance of performance is important to me. I'd rather select a model with lower average performance but with better worst cases.
I therefore run several evaluation steps. What I would like to do is to parallelize these evaluation steps because right now, only the master is evaluating. When using large clusters, it is a source of inefficiency, and task workers to evaluate as well.
Basically, the supervisor is created as :
self.sv = tf.train.Supervisor(
# Write summary_ops by hand.
# No saving; we do it manually in order to easily evaluate immediately
# afterwards.
At the end of training I call the summary writer. :
# only on master, this is what I want to remove
if self.is_master and not self.should_stop:
# I want to have an idea of statistics of accuracy
# not just the mean, hence I run on 10 batches
for i in range(10):
self.global_step += 1
# I call an evaluator, and extract the accuracy
evaluation_values = self.evaluator.evaluate()
accuracy_value = self.model.accuracy_value(evaluation_values)
# now I dump the accuracy, ready to use within hptune
eval_summary = tf.Summary(value=[
tag='training/hptuning/metric', simple_value=accuracy_value)
self.sv.summary_computed(session, eval_summary, self.global_step)
I tried to write summaries from workers as well , but I got an error : basically summary can be written from masters only. Is there any easy way to workaround ? The error is : "Writing a summary requires a summary writer."
My guess is you'd create a separate summary writer on each worker yourself, and write out summaries directly rather.
I suspect you wouldn't use a supervisor for the eval processing either. Just load a session on each worker for doing eval with the latest checkpoint, and writing out independent summaries.

How should I test on a small dataset?

I use Weka to test machine learning algorithms on my dataset. I have 3800 rows and around 25 features. I am testing the combination of different features for prediction models and seem to predict lower than just the oneR algorithm does with the use of Cross-validation. Even C4.5 does not predict better, sometimes it does and sometimes it does not on basis of the features that are still able to classify.
But, on a certain moment I splitted my dataset in a testset and dataset(20/80), and testing it on the testset, the C4.5 algorithm had a far higher accuracy than my OneR algorithm had. I thought, with the small size of the dataset, it probably is just a coincidence that it predicted very well(the target was still splitted up relatively as target attributes). And therefore, its more useful to use Cross-validation on small datasets like these.
However, testing it on another testset, did give the high accuracy towards the testset using C4.5. So, my question actually is, what is the best way to test datasets when the datasets are actually pretty small?
I saw some posts where it is discussed, but I am still not sure what is the right way to do it.
It's almost always a good approach to test your model via Cross-Validation.
A rule of thumb is to use 10 fold cross validation.
In your case, 10 fold cross validation will do the following in Weka:
split your 3800 training instances into 10 sets of 380 instances
for each set (s = 1 .. 10) :
use the instances from s for testing and the other 9 sets for training a model (3420 training instances)
the result will be an average of the results obtained with the 10 models used.
Try to avoid testing your dataset using the training set option, because that could result in creating a model that works very well for you existing data but could have big problems with other new instances (overfitting).