How to run Tensorflow on SLURM cluster with properly configured parameter server? - tensorflow

I am in the fortunate position of having access to my university's SLURM powered GPU cluster. I have been trying to get Tensorflow to run a in a cluster node, but so far I have failed to find any documentation. (Everyone I have spoken to at the university has run it using CPU nodes before or using a single GPU node.
I found an excellent bit of documentation from this previous question here. Unfortunately, it's rather incomplete. All of the other distributed examples I have found such as such as this one rely on explicitly specifying the parameter server.
When I try to run it using the code from the SO question, I appears to work perfectly until it either fails to connect to a nonexistent parameter server or hangs when server.join is called and no print outs are provided to the sbatch outfile (which I understand should happen).
So in short, my question is how would one go about starting Tensorflow on a SLURM cluster? From the sbatch stage onwards. This is my first time dealing with a distributed computing framework besides SPARK on AWS and I would love to learn more about how to properly configure Tensorflow. How do I specify which one of the items in the tf_hostlist for example server as the parameter server? Alternatively can I use sbatch to send slightly different commands to each worker as I have seen done in other examples?

Related

How to define multiple gres resources in SLURM using the same GPU device?

I'm running machine learning (ML) jobs that make use of very little GPU memory.
Thus, I could run multiple ML jobs on a single GPU.
To achieve that, I would like to add multiple lines in the gres.conf file that specify the same device.
However, it seems the slurm deamon doesn't accept this, the service returning:
fatal: Gres GPU plugin failed to load configuration
Is there any option I'm missing to make this work?
Or maybe a different way to achieve that with SLURM?
It is kind of smiliar to this one, but this one seems specific to some CUDA code with compilation enabled. Something which seems way more specific than my general case (or at least as far as I understand).
How to run multiple jobs on a GPU grid with CUDA using SLURM
I don't think you can oversubscribe GPUs, so I see two options:
You can configure the CUDA Multi-Process Service or
pack multiple calculations into a single job that has one GPU and run them in parallel.
Besides nVidia MPS mentioned by #Marcus Boden, which is relevant for V100 types of cards, there also is Multi-Instance GPU which is relevant for A100 types of cards.

TensorFlow 2.0 - How to create a worker? Cluster

I'm new to Tensorflow, I want to perform distributed computing/training using different machines.
The tutorial in this link mentions:
In practice, users would create multiple workers on external IP addresses/ports, and set TF_CONFIG on each worker appropriately.
I didn't find anything that tells how to do that.
I did find tutorials that used an old version of TensorFlow, but there was no TF_CONFIG there and I don't see any ClusterSpec used in the example, so I'm very confused.
Turns out the answer was simpler than expected.
Set the same TF_CONFIG on all machines and then run the same script on all machines.
The training does not start until all nodes/workers are connected.

Running gradient-free optimization methods in parallel with OpenMDAO and PyOptSparse

I would like to run ALPSO and NSGA2 from OpenMDAO using the PyOptSparse driver in parallel. The catch is that I don't want to run the model itself in parallel (which I have done frequently in OpenMDAO), I just want to run the optimization computations in parallel (e.g. distribute the calculations for swarm members of ALPSO). I have been looking through the documentation and source for all of the above mentioned codes, but I have not found a way to do this. Could someone point me in the right direction?
Note: I am currently using OpenMDAO 1.7.3, but I am open to answers involving later versions
I don't believe that those optimizers support parallel execution. It would most likely require modifications to the code in ALPSO/NSGA2, pyoptsparse, and the pyoptsparse driver to support this.
In OpenMDAO 2.2 (the latest version), we do have a simple GA driver that can run the evaluation of points in the population in parallel, so maybe that is an option. (it is pretty simple though, and only supports single objective.)

Is it possible to use TensorFlow Serving with distributed TensorFlow cluster to improve throughput/latency?

I'm looking into ways to improve latency and/or throughput of a TensorFlow Serving instance. I've seen the "Serving Inception" manual and three GitHub Issues (2, 3, 4), but all of them seem to create a separate instance of TensorFlow Serving per server and then choosing server on client. Issue 4 is actually about adding some load balancer in front of that stuff, which is currently absent in TensorFlow Serving itself.
However, there is also "Distributed TensorFlow" tutorial which shows how to join a set of machines into a fixed cluster and then manually "pin" some computations to some machines, which can improve both latency and throughput if model is "wide" and can be parallelized well. However, I do not see any mentions of combining this with TensorFlow Serving in either documentation.
Question is: is it possible to configure TensorFlow Serving to use distributed TensorFlow cluster?
I was able to make it create and use gRPC sessions (instead of local) with some hacks:
Make tensorflow/core/distributed_runtime/rpc:grpc_session target publicly visible (it's internal to tensorflow package by default) by modifying BUILD file.
Add it as a dependency to the tensorflow_serving/model_servers:tensorflow_model_server target.
Add an extra flag to tensorflow_model_server called --session_target which sets up session_bundle_config.session_target() in main.cc.
Run the binary with --session_target=grpc://localhost:12345, where localhost:12345 is an arbitrary node which will be used to create master sessions.
See my cluster performing some computations on behalf of TensorFlow Serving.
However, this set of hacks does not look enough for "real-world usage" for three reasons:
grpc_session target is probably internal for a reason.
As noticed in my other question, distributed TensorFlow works better when computations are manually "pinned" to specific machines. So, if we use TensorFlow Serving, we need a way to save those "pins" and model's structure becomes tied with cluster's structure. I'm not sure whether this information is exported with Exporter/Saver at all.
tensorflow_model_server creates session once - during bootstrap. If master node of the cluster goes down and then restores, serving server still holds the "old" session and cannot process further requests.
All in all, it looks like this scenario is not officially supported yet, but I'm not sure.
If your model fits into single machine, then it's hard to see how distributing it over many machines will improve throughput. Essentially you are taking computations which can be done independently and adding a dependency. If one of your machines is slow or crashes, instead of making some queries slow, it will make all queries sow.
That said, it's worth benchmarking to see if it helps, in which case it would make sense to ask for this use-case to be officially supported.
Regarding questions:
Worker assignments are done through device field in graph .pbtxt. Some importers/exporters clear those assignments and have clear_devices flag. You could open graph definition (.pbtxt file or equivalently, str(tf.get_default_graph().as_graph_def(), and grep for device strings to check)
If any worker restarts, or there's some temporary network connectivity your sess.run fails with error (Unavailable) and you need to recreate the session. This is handled automatically by MonitoredTrainingSession in tf.train, but you need to handle this yourself with serving.
If your model is not using images, or is not entirely too large, you shouldn't need too much compute for each inference/serve, and I'm saying this using Inception-v# which takes ~1 sec to serve a response to an image on a Google Cloud Platform n1-standard-1 machine.
Now that being said, perhaps its the throughput that you need to scale up and that is a different problem. Your best option for scale at that point would be to use Docker Swarm & Compose, as well as Kubernetes to help scale e up and serve your inference "micro-service". You could use flask to iterate over a sequence of requests also if your use-case warrants it.

How is Docker different from a normal OS process?

There is a question How is Docker.io different from a normal virtual machine?, where an answer goes into details describing how lightweight Docker is and how isolated it is. I am trying to understand:
How is Docker different from a regular OS process?
What benefitы does it provide on top of a separate OS process?
dotCloud did a series of articles talking about how containers build on OS namespaces and groups. They were called Under the Hood and while they were focused on the dotCloud PaaS, the general principles apply to all container systems.
So in the big picture, Docker processes are exactly regular OS processes. They just set some extra parameters (namespaces, cgroups, filesystem mounts) that are normally left to default values.
When you set these values to non-default parameters, you get additional isolation for your new process and more control over resources they use.