Simplest way to distribute Tensorflow training on premise? - tensorflow

What is the simplest way to train tensorflow models (using Estimator API) distributed across a home network? Doesn't look like ml-engine local train allows you to specify IPs.

Your best bet is to use something like Kubernetes. This is a work in progress, but I believe it does have support for distributed training as well -- https://github.com/tensorflow/k8s.
Alternatively for more low-tech automation options, these come to mind...
You could have a script which still uses SSH and executes a script remotely.
You could have the individual workers poll a shared location for a file to use as a signal to download and execute a script.

You can set the environment variable TF_CONFIG, which will be parsed by estimators.

Related

ML model serving with great developer ergonomics

We are looking for ML model serving with a developer experience where the ML engineers don’t need to know Devops.
Ideally we are looking for the following ergonomics or something similar:
Initialize a new model serving end point preferably by a CLI, get a GCS bucket
each time we train a new model, we put it in the GCS bucket of step 1.
The serving system guarantees that the most recent model in the bucket is served unless a model is specified by version number.
We are also looking for a service that optimizes cost and latency.
Any suggestions?
Have you considered https://www.tensorflow.org/tfx/serving/architecture? You can definitely automate the entire workflow using tfx. I think the guide here does a good job walking through it. Depending on your use-case, you may want to use tft instead of Kubeflow like they're doing in that guide. Besides serving automation, you may also want to consider pipeline automation to separate the feature engineering from the pipeline mechanics itself. For example, you can build the pipeline, abstract out the feature engineering into a tensorflow function meeting certain requirements, and automate the deployment process also. This way you don't need to deal with the feature specs/schemas manually, and you know that your transformations are the same during serving as they were while training.
You can do the same thing with scikit-learn also, and I believe serving scikit-learn models is also supported under the vertex-ai umbrella.
To your point about latency, you definitely want the pipeline doing the transformations on the gpu, as such, I would recommend using tensorflow over something like scikit-learn if the use-case is truly time sensitive.
Best of luck!

Running TensorFlow Extended (TFX) on AWS

I was wondering if it is possible/how easy it would be to implement a TFX pipeline (on a real dataset, with 100+ GB dataset, not a tutorial with a small dataset) in AWS?
For the orchestration, I might use Kubeflow. But I suppose, the major issue would be setting up a proper scalable runner for the Apache Beam. I am thinking of using Apache Flink for that.
Anyone with experience doing it? How would you go about putting a TF in production in AWS in general when you need to train the model on a regular basis on new data, do you write the pipeline from scratch or use some tool?

Correct way to run distributed training in ML Engine

Im trying to train my model (that is not build with tf.estimator or tf.keras) using distributed training job in ML Engine.
What steps should i take in order to run distributed training job in ML Engine?
I found following guidelines:
provide --scale-tier parameter, from step-by-step guide
use distributed strategy API in the code, from recent google io talks
So if former provided in the command line does it mean i don't need to do anything with latter because ML Engine somehow takes care of distributing my graph across devices? Or do i need to do both?
And also what happens if i manually specify devices using:
with tf.device('/gpu:0/1/2/etc')
..and then run the command with --scale-tier?
There are two possible scenarios:
- You want to use machines with CPU:
In this case, you are right. Using --scale-tier parameter is enough to have a job that is distributed automatically in ML Engine.
You have several scale-tier options {1}.
- You want to use machines with GPU:
In this case, you have to define a config.yaml file that describes the GPU options you want and run a gcloud command to launch the ML Engine job with config.yaml as a parameter {2}.
If you use with tf.device('/gpu:0/1/2/etc') inside your code, you are forcing the use of that device and it overwrites the normal behavior. {3}.
{1}: https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs#scaletier
{2}: https://cloud.google.com/ml-engine/docs/tensorflow/using-gpus#requesting_gpu-enabled_machines
{3}: https://www.tensorflow.org/programmers_guide/using_gpu

Already implemented neural network on Google Cloud Platform

I have implemented a neural network model using Python and Tensorflow, which normally runs on my own computer.
Now I would like to train it on new datasets on the Google Cloud Platform. Do you think it is possible? Do I need to change my code?
Thank you very much for your help!
Google Cloud offers the Cloud ML Engine service, which allows to train your models and perform predictions without the need of running and maintaining an instance with the required software.
In order to run the TensorFlow NN models you already have, you will not need to change your code, you will only have to package the trainer appropriately, as described in the documentation, and run a ML Engine job that performs the training itself. Once you have your model, you can also deploy it in the same service and later get predictions with different features depending on your requirements (urgency in getting the predictions, data set sources, etc.).
Alternatively, as suggested in the comments, you can always launch a Compute Engine instance and run there your TensorFlow model as if you were doing it locally in your computer. However, I would strongly recommend the approach I proposed earlier, as you will be saving some money, because you will only be charged for your usage (training jobs and/or predictions) and do not need to configure an instance from scratch.

Unable to use Tensorboard in Distributed Tensorflow

I am starting to play with Distributed Tensorflow. I am able to distribute the training in different servers succesfully but I cannot see any summary in tensorboard.
Does anyone know if there are any limitation or caveat with this?
Thanks
There is a caveat, which is that TensorBoard doesn't support replicated summary writers. Other than that, it will work.
Choose one TensorFlow worker to be the summary writer, and have it write summaries to disk. Then, launch TensorBoard pointing to the summary files that you've saved (the simplest would be to launch TensorBoard on the same server that the summary worker is on - alternatively, you could copy the files off that server onto your machine, etc).
Note, in the special case where you are using Google Cloud, TensorBoard can read directly from gcs paths.