I am starting to play with Distributed Tensorflow. I am able to distribute the training in different servers succesfully but I cannot see any summary in tensorboard.
Does anyone know if there are any limitation or caveat with this?
Thanks
There is a caveat, which is that TensorBoard doesn't support replicated summary writers. Other than that, it will work.
Choose one TensorFlow worker to be the summary writer, and have it write summaries to disk. Then, launch TensorBoard pointing to the summary files that you've saved (the simplest would be to launch TensorBoard on the same server that the summary worker is on - alternatively, you could copy the files off that server onto your machine, etc).
Note, in the special case where you are using Google Cloud, TensorBoard can read directly from gcs paths.
Related
I am trying to distribute my workload to multiple GPUs with AWS Sagemaker. I am using a custom algorithm for a DCGAN with tensorflow 2.0. The code thus far works perfect on a single GPU. I decided to implement the same code but with horovod distribution across multiple GPUs to reduce run time. The code, when changed from the original to horovod, seems to work the same, and the training time is roughly the same. However, when I print out hvd.size() I am only getting a size of 1, regardless of the multiple GPU's present. Tensorflow recognizes all the present GPU's; Horovod, no.
I've tried running my code on both Sagemaker and on an EC2 instance in a docker container, and in both environments the same issue persists.
Here is the a link to my github repo:
Here
I've also tried using a different neural network entirely from the horovod repository, updated to tf2.0:
hvdmnist
At this point I am only trying to get the GPU's within one instance to be utilized, and am not trying utilize multiple instances.
I think I might be missing a dependency of some sort in the docker image, either that or there is some sort of prerequisite command for me to run. I don't really know.
Thanks.
Im trying to train my model (that is not build with tf.estimator or tf.keras) using distributed training job in ML Engine.
What steps should i take in order to run distributed training job in ML Engine?
I found following guidelines:
provide --scale-tier parameter, from step-by-step guide
use distributed strategy API in the code, from recent google io talks
So if former provided in the command line does it mean i don't need to do anything with latter because ML Engine somehow takes care of distributing my graph across devices? Or do i need to do both?
And also what happens if i manually specify devices using:
with tf.device('/gpu:0/1/2/etc')
..and then run the command with --scale-tier?
There are two possible scenarios:
- You want to use machines with CPU:
In this case, you are right. Using --scale-tier parameter is enough to have a job that is distributed automatically in ML Engine.
You have several scale-tier options {1}.
- You want to use machines with GPU:
In this case, you have to define a config.yaml file that describes the GPU options you want and run a gcloud command to launch the ML Engine job with config.yaml as a parameter {2}.
If you use with tf.device('/gpu:0/1/2/etc') inside your code, you are forcing the use of that device and it overwrites the normal behavior. {3}.
{1}: https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs#scaletier
{2}: https://cloud.google.com/ml-engine/docs/tensorflow/using-gpus#requesting_gpu-enabled_machines
{3}: https://www.tensorflow.org/programmers_guide/using_gpu
What is the simplest way to train tensorflow models (using Estimator API) distributed across a home network? Doesn't look like ml-engine local train allows you to specify IPs.
Your best bet is to use something like Kubernetes. This is a work in progress, but I believe it does have support for distributed training as well -- https://github.com/tensorflow/k8s.
Alternatively for more low-tech automation options, these come to mind...
You could have a script which still uses SSH and executes a script remotely.
You could have the individual workers poll a shared location for a file to use as a signal to download and execute a script.
You can set the environment variable TF_CONFIG, which will be parsed by estimators.
I have implemented a neural network model using Python and Tensorflow, which normally runs on my own computer.
Now I would like to train it on new datasets on the Google Cloud Platform. Do you think it is possible? Do I need to change my code?
Thank you very much for your help!
Google Cloud offers the Cloud ML Engine service, which allows to train your models and perform predictions without the need of running and maintaining an instance with the required software.
In order to run the TensorFlow NN models you already have, you will not need to change your code, you will only have to package the trainer appropriately, as described in the documentation, and run a ML Engine job that performs the training itself. Once you have your model, you can also deploy it in the same service and later get predictions with different features depending on your requirements (urgency in getting the predictions, data set sources, etc.).
Alternatively, as suggested in the comments, you can always launch a Compute Engine instance and run there your TensorFlow model as if you were doing it locally in your computer. However, I would strongly recommend the approach I proposed earlier, as you will be saving some money, because you will only be charged for your usage (training jobs and/or predictions) and do not need to configure an instance from scratch.
I'm actualy new in Machine Learning, but this theme is vary interesting for me, so Im using TensorFlow to classify some images from MNIST datasets...I run this code on Compute Engine(VM) at Google Cloud, because my computer is to weak for this. And the code actualy run well, but the problam is that when I each time enter to my VM and run the same code I need to wait while my model is training on CNN, and after I can make some tests or experiment with my data to plot or import some external images to impruve my accuracy etc.
Is There is some way to save my result of trainin model just once, some where, that when I will decide for example to enter to the same VM tomorrow...and dont wait anymore while my model is training. Is that possible to do this ?
Or there is maybe some another way to do something similar ?
You can save a trained model in TensorFlow and then use it later by loading it; that way you only have to train your model once, and use it as many times as you want. To do that, you can follow the TensorFlow documentation regarding that topic, where you can find information on how to save and load the model. In short, you will have to use the SavedModelBuilder class to define the type and location of your saved model, and then add the MetaGraphs and variables you want to save. Loading the saved model for posterior usage is even easier, as you will only have to run a command pointing to the location of the file in which the model was exported.
On the other hand, I would strongly recommend you to change your working environment in such a way that it can be more profitable for you. In Google Cloud you have the Cloud ML Engine service, which might be good for the type of work you are developing. It allows you to train your models and perform predictions without the need of an instance running all the required software. I happen to have worked a little bit with TensorFlow recently, and at first I was also working with a virtualized instance, but after following some tutorials I was able to save some money by migrating my work to ML Engine, as you are only charged for the usage. If you are using your VM only with that purpose, take a look at it.
You can of course consult all the available documentation, but as a first quickstart, if you are interested in ML Engine, I recommend you to have a look at how to train your models and how to get your predictions.