How to combine multiple models together? - tensorflow

I am trying to "parallelize" Neural Network models to speed up training. One idea I had was to run two models on two computers and combine the results somehow.
Is this possible? If not, what are the options to parallelize model training on two computers?
I am open to use any neural network framework.

I think you mean distributed tensorflow?
See official document: https://www.tensorflow.org/deploy/distributed

Related

Serving hundreds of models with Tensorflow serving

I would like to serve about ~600 models with Tensorflow Serving.
I am trying to find a solution to eventually reduce the number of models:
My models have the same architecture, only the weights changes.
Is it possible to load only one model and changing the weights?
Would it be possible to aggregate all those models together and effectively, the first layer of the model would be an ID and the input features for that model?
Has anyone tried having couple of hundreds models running on one machine? I have find this cortex solution, but wanted to avoid using another tech.
https://towardsdatascience.com/how-to-deploy-1-000-models-on-one-cpu-with-tensorflow-serving-ec4297bff54b
If the models have the same architecture but different weight, you can try merging all those model into a "super model". However I would need to know more about the task to see if that's possible.
To serve 600 models, you would need a very powerful machine and lot of memory (depending on how big your models are and how much you use them in parallel).
You can either run TFServe yourself, or use a provider such as Inferrd.com/Google/AWS.

How to connect different deep learning architectures?

Based on 5 features extracted from a sample of binary files, the idea is to combine different deep learning models each of them processing one feature sample.
Or simply is there a way to connect a CNN and a RNN, in a way that the output of the CNN would be the input of the RNN ?
Any help or reference would be appreciated
The Keras Functional API can be used to combine different Deeplearing models.
It is much more flexible than the Keras Sequential API, in that it can support multiple input, output pipelines.
You can implement non-linear topology with the Functional API.
For example:

Is it possible to train a single Keras Model on Two different GPU on two different systems and combine the training?

I'm using Google Colab for training my models.
But speed is still low.
So is there a way I can train from two different accounts and combine the training later?
No, you cannot train using 2 accounts the same model on colab. Google colab is for research purposes only. Not to train large scale production models. Colab also disconnects the kernel every 12 hour.
You can instead train the model using multiple GPU's on a single computer. Keras supports multi GPU training when using tensorflow as backend. But training on two different computers/VM is not possible. How will gradients flow during back propagation?
There is a solution though, but not an end-to-end approach. You can split your model into two different models, where the output of first model will become the input for second and second will produce the final output. For this you need a different training set for each model.
Take this example.
Suppose you are building a face recogniser where the model takes in a raw camera picture and recognises the face as yes/no.
Instead of training this big Networks you could split it into two different nets, where task for first net will be to crop the face and remove other useless things from image and second to recognise from cropped image.
This is non end-to-end model, and you can train the two models diffently on different machines with different dataset and then eventually merge it together. This is usually more powerful and easy to train.
Look up this question Tensorflow Combining Two Models End to End
Another possibility is to ensemble the two trained models. You'd have to make sure however that the data for both of the models are coming from the same distribution.

TensorFlow checkpoints and models vis-a-vis multi-gpu settings

Let us take a practical situation a researcher often finds him/herself into when using TensorFlow :
Multiple GPUs are available for training and I'd like to use them for speedup.
Subsequently I'd like to give the trained model to a colleague or collaborator with a different (maybe 1 !!) number of GPUs.
It is important that the code need not be modified for use when shared with multiple collaborators.
However, TensorFlow documentation/examples are not very clear/explained for such a scenario.
Basic questions are :
How do I write a code which involves training a model with multiple GPUs and where the model can be easily restored from checkpoints ?
How do I deal with the situation where my collaborators have different number of GPU resources ? More precisely, what best practices should I follow to ensure that the code and model I share with them is easily usable by them ?
Are there some examples or best practices other TensorFlow users (facing the same situation !!) can share ?
NOTE : I am not looking for a readymade solution. My prime purpose is to understand a TensorFlow feature which is not very well documented.

distributed tensorflow clarification

Is my understanding correct that model_deploy lets the user train a model using multiple devices on a single machine? The basic premise seems that the clone devices do variable sharing and variables get distributed to param servers in a round-robin fashion.
On the other hand distributed tensorflow framework enables the user to train a model through a cluster. A Cluster lets the user train a model using multiple devices across multiple servers.
I think the Slim documentation is very slim and the point has been raised couple of times already: Configuration/Flags for TF-Slim across multiple GPU/Machines
Thank you.