Federated Learning in Tensorflow Federated, is there any way to apply Early stopping on the client side? - tensorflow

I am using Tensorflow Federated to train a text classification model with the federated learning approach.
Is there any way to apply Early Stopping on the client-side? Is there an option for cross-validation in the API?
The only thing I was able to find is the evaluation:
evaluation = tff.learning.build_federated_evaluation(model_fn)
Which is applied to the model by the end of a federated training round.
Am I missing something?

One straightforward way to control the number of steps a client takes when using tff.learning.build_federated_averaging_process is by setting up each clients tf.data.Dataset with different parameters. For example limiting the number of steps with tf.data.Dataset.take. The guide tf.data: Build TensorFlow input pipelines has many more details.
Alternatively stopping based on a measurement of learning progress would require modifying some internals of the algorithm currently. Rather than using the APIs in tff.learning, it maybe simpler to poke around federated/tensorflow_federated/python/examples/simple_fedavg/ particularly the client training loop is here and could be modified to stop based on some criteria other than "end of dataset" (as currently used).

Related

Programmatic Hyperparameter Tuning for TensorFlow Object Detection API

I am using the TF Object Detection API. I have a custom data set. I am training using SLURM jobs and calling the API scripts from within there. I am looking to try and tune hyperparameters found in the pipeline.config files. Unfortunately, in the documentation, this kind of process is not outlined. It seems like the process is to either use the sample configs or tune the hyperparameters by hand.
Tuning by hand is somewhat feasible, for example adjusting for two parameters for three values (batch size and steps) results in nine different .configs, but adding another hyperparameter to that boosts it up to twenty-seven files I need to keep track of. This does not seem like a good way to do it, particularly because it limits the values I can try and is clumsy.
It seems like there are libraries out there that hook into Keras and other more high-level frameworks, but I have found nothing that looks like it can take the results of the Object Detection API and actually optimize it.
Is it possible to do this with a pre-built library I don't know about? I would like to avoid having to edit the API implementation or coding this myself to minimize errors.

Tensorflow Object-Detection API - Hyperparameter Tuning & Grid Search

I am currently working with the Tensorflow Object-Detection API and I want to fine-tune a pre-trained model. Therefore, a hyperparameter-tuning is required.
Does the API already provide some kind of hyperparameter-tuning (like a grid search)? If there is nothing available, how can I implement a simple grid search to tune (the most relevant) hyperparameters?
Furthermore, does the API provide some kind of Early Stopping function that automatically aborts the training process if the accuracy does not change anymore.
Thanks a lot in advance.

TensorFlow checkpoints and models vis-a-vis multi-gpu settings

Let us take a practical situation a researcher often finds him/herself into when using TensorFlow :
Multiple GPUs are available for training and I'd like to use them for speedup.
Subsequently I'd like to give the trained model to a colleague or collaborator with a different (maybe 1 !!) number of GPUs.
It is important that the code need not be modified for use when shared with multiple collaborators.
However, TensorFlow documentation/examples are not very clear/explained for such a scenario.
Basic questions are :
How do I write a code which involves training a model with multiple GPUs and where the model can be easily restored from checkpoints ?
How do I deal with the situation where my collaborators have different number of GPU resources ? More precisely, what best practices should I follow to ensure that the code and model I share with them is easily usable by them ?
Are there some examples or best practices other TensorFlow users (facing the same situation !!) can share ?
NOTE : I am not looking for a readymade solution. My prime purpose is to understand a TensorFlow feature which is not very well documented.

Cloud ML Engine distributed training default type for custom tf.estimator

This article suggests there are three options for distributed training
Data-parallel training with synchronous updates.
Data-parallel training with asynchronous updates.
Model-parallel training.
The tutorial then goes on to suggest that the code that follows performs data-parallel training with asynchronous updates on Cloud ML Engine which behaves as "If you distribute 10,000 batches among 10 worker nodes, each node works on roughly 1,000 batches."
However, it's not clear what portion of the code actually specifies that this is using data-parallel training with asynchronous updates. Is this simply the default for ML engine if you run it in distributed training mode with a custom tf.estimator?
The short answer is that tf.estimator is currently mostly built around Data-parallel training (2).
You get Model-parallel training simply by using with tf.device() statements in your code.
You could try to use SyncReplicasOptimizer and probably accomplish synchronous training (1).
All of the above applies generally to tf.estimator; nothing is different for CloudML Engine.
Cloud ML Engine doesn't determine the mode of distributed training. This depends on how the user sets up training using the TensorFlow libraries. In the mnist example linked from the article, the code is using TF Learn classes specifically an Estimator is constructed in model.py
That code selects the optimizer which in this case is the AdamOptimizer which uses asynchronous updates. If you wanted to do synchronous updates you'd have to use a different optimizer such as SyncReplicasOptimizer.
For more information on how to setup synchronous training you can refer to this doc.

Real Time Object detection using TensorFlow

I have just started experimenting with Deep Learning and Computer Vision technologies. I came across this awesome tutorial. I have setup the TensorFlow environment using docker and trained my own sets of objects and it provided greater accuracy when I tested it out.
Now I want to make the same more real-time. For example, instead of giving an image of an object as the input, I want to utilize a webcam and make it recognize the object with the help of TensorFlow. Can you guys guide me with the right place to start with this work?
You may want to look at TensorFlow Serving so that you can decouple compute from sensors (and distribute the computation), or our C++ api. Beyond that, tensorflow was written emphasizing throughput rather than latency, so batch samples as much as you can. You don't need to run tensorflow at every frame, so input from a webcam should definitely be in the realm of possibilities. Making the network smaller, and buying better hardware are popular options.