I am trying to distribute my workload to multiple GPUs with AWS Sagemaker. I am using a custom algorithm for a DCGAN with tensorflow 2.0. The code thus far works perfect on a single GPU. I decided to implement the same code but with horovod distribution across multiple GPUs to reduce run time. The code, when changed from the original to horovod, seems to work the same, and the training time is roughly the same. However, when I print out hvd.size() I am only getting a size of 1, regardless of the multiple GPU's present. Tensorflow recognizes all the present GPU's; Horovod, no.
I've tried running my code on both Sagemaker and on an EC2 instance in a docker container, and in both environments the same issue persists.
Here is the a link to my github repo:
Here
I've also tried using a different neural network entirely from the horovod repository, updated to tf2.0:
hvdmnist
At this point I am only trying to get the GPU's within one instance to be utilized, and am not trying utilize multiple instances.
I think I might be missing a dependency of some sort in the docker image, either that or there is some sort of prerequisite command for me to run. I don't really know.
Thanks.
Related
Currently I am studying the usage of Apache Spark 3.0 with Rapids GPU Acceleration. In the official spark-rapids docs I came across this page which states:
There are cases where you may want to get access to the raw data on the GPU, preferably without copying it. One use case for this is exporting the data to an ML framework after doing feature extraction.
To me this sounds as if one could make data that is already available on the GPU from some upstream Spark ETL process directly available to a framework such as Tensorflow or PyTorch. If this is the case how can I access the data from within any of these frameworks? If I am misunderstanding something here, what is the quote exactly referring to?
The link you references really only allows you to get access to the data still sitting on the GPU, but using that data in another framework, like Tensorflow or PyTorch is not that simple.
TL;DR; Unless you have a library explicitly setup to work with the RAPIDS accelerator you probably want to run your ETL with RAPIDS, then save it, and launch a new job to train your models using that data.
There are still a number of issues that you would need to solve. We have worked on these in the case of XGBoost, but it has not been something that we have tried to tackle for Tensorflow or PyTorch yet.
The big issues are
Getting the data to the correct process. Even if the data is on the GPU, because of security, it is tied to a given user process. PyTorch and Tensorflow generally run as python processes and not in the same JVM that Spark is running in. This means that the data has to be sent to the other process. There are several ways to do this, but it is non-trivial to try and do it as a zero-copy operation.
The format of the data is not what Tensorflow or PyTorch want. The data for RAPIDs is in an arrow compatible format. Tensorflow and PyTorch have APIs for importing data in standard formats from the CPU, but it might take a bit of work to get the data into a format that the frameworks want and to find an API to let you pull it in directly from the GPU.
Sharing GPU resources. Spark only recently added in support for scheduling GPUs. Prior to that people would just launch a single spark task per executor and a single python process so that the python process would own the entire GPU when doing training or inference. With the RAPIDS accelerator the GPU is not free any more and you need a way to share the resources. RMM provides some of this if both libraries are updated to use it and they are in the same process, but in the case of Pytorch and and Tensoflow they are typically in python processes so figuring out how to share the GPU is hard.
I have recently become interested in incorporating distributed training into my Tensorflow projects. I am using Google Colab and Python 3 to implement a Neural Network with customized, distributed, training loops, as described in this guide:
https://www.tensorflow.org/tutorials/distribute/training_loops
In that guide under section 'Create a strategy to distribute the variables and the graph', there is a picture of some code that basically sets up a 'MirroredStrategy' and then prints the number of generated replicas of the model, see below.
Console output
From what I can understand, the output indicates that the MirroredStrategy has only created one replica of the model, and thereofore, only one GPU will be used to train the model. My question: is Google Colab limited to training on a single GPU?
I have tried to call MirroredStrategy() both with, and without, GPU acceleration, but I only get one model replica every time. This is a bit surprising because when I use the multiprocessing package in Python, I get four threads. I therefore expected that it would be possible to train four models in parallel in Google Colab. Are there issues with Tensorflows implementation of distributed training?
On google colab, you can only use one GPU, that is the limit from Google. However, you can run different programs on different gpu instances so by creating different colab files and connect them with gpus but you can not place the same model on many gpu instances in parallel.
There are no problems with mirrored startegy, talking from personal experience it works fine if you have more than one GPU.
I have a system with two GPUs, and am using Keras with Tensorflow backend. Gpu:0 is being allocated to PyCUDA, which is performing a unique operation which is fed forward to Keras, and changes with each batch. As such, I would like to run a Keras model on gpu:1 while leaving gpu:0 allocated to PyCUDA.
Is there any way to do this? Looking through prior threads I've found several depreciated solutions.
So I don't think that this feature is meaningfully implemented in Keras currently. Found a workaround that I recommend whereby you just create multiple processes using Python's default multiprocessing library.
Note: Currently for this setup you need to spawn the new process, rather than fork it, to avoid a weird interaction with one of the PyCUDA backend libraries.
Im trying to train my model (that is not build with tf.estimator or tf.keras) using distributed training job in ML Engine.
What steps should i take in order to run distributed training job in ML Engine?
I found following guidelines:
provide --scale-tier parameter, from step-by-step guide
use distributed strategy API in the code, from recent google io talks
So if former provided in the command line does it mean i don't need to do anything with latter because ML Engine somehow takes care of distributing my graph across devices? Or do i need to do both?
And also what happens if i manually specify devices using:
with tf.device('/gpu:0/1/2/etc')
..and then run the command with --scale-tier?
There are two possible scenarios:
- You want to use machines with CPU:
In this case, you are right. Using --scale-tier parameter is enough to have a job that is distributed automatically in ML Engine.
You have several scale-tier options {1}.
- You want to use machines with GPU:
In this case, you have to define a config.yaml file that describes the GPU options you want and run a gcloud command to launch the ML Engine job with config.yaml as a parameter {2}.
If you use with tf.device('/gpu:0/1/2/etc') inside your code, you are forcing the use of that device and it overwrites the normal behavior. {3}.
{1}: https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs#scaletier
{2}: https://cloud.google.com/ml-engine/docs/tensorflow/using-gpus#requesting_gpu-enabled_machines
{3}: https://www.tensorflow.org/programmers_guide/using_gpu
I'm trying to build a distribute tensorflow framwork template, but there are serval problems confused me.
when I used --sync_replas=True in the script,does it mean I use Synchronous training as in doc?
why the global step in worker_0.log and worker_1.log
is not successively increment?
why the global step not start with 0 but like this
1499169072.773628: Worker 0: training step 1 done (global step: 339)
what's the relation between training step and global step?
As you can see from the create cluster script, I created an independent cluster.Can I run multiple different models on this cluster at the same time?
Probably but depends on the particular library
During distributed training it's possible to have race conditions so the increments and reads of the global step are not fully ordered. This is fine.
This is probably because you're loading from a checkpoint?
Unclear, depends on the library you're using
One model per cluster is much easier to manage. It's fine to create multiple tf clusters on the same set of machines, though.