Can I set workloads in mxnet when using a distributed environment (multi nodes)? - mxnet

I want to ask whether I can set different workloads when I use distributed computing environment using mxnet. I read some tutorial for distributed GPUs.
But I want to use distributed nodes (CPUs) environment and I want to set different workload to them. Can I do that? If yes, then can I get some examples about it?
Thank you for your answer!

Yes it is supported. Check this link which shows that you can specify work_load_list according to GPUs or CPUs you want to distribute your work load.
http://mxnet.io/how_to/multi_devices.html#advanced-usage
Also, you should check python API reference (http://mxnet.io/api/python/model.html#mxnet.model.FeedForward). work_load_list parameter can be set while doing model.Feedforward.fit(__)
Hope this helps!

Related

Passing cpu core to multiworkermirrored strategy()

I would like to implement mirrored strategy using cpu's but i dont know how to frame the parameters to be passed to mirroredstrategy(). This is the line of code as it is for gpu's, distribution = tf.contrib.distribute.MultiworkerMirroredStrategy(["/device:GPU:0", "/device:GPU:1", "/device:GPU:2"])
i could change "/device:GPU:0", to "/device:CPU:0", but that seems to only use one core or does it , how would i check?
TensorFlow can make use of multiple CPU cores out of the box, so you do not need to use a strategy in this case. MultiworkerMirroredStrategy is only needed if you want to train with multiple machines. Those machines can each have GPU(s) or be CPU only.

TensorFlow 2.0 - How to create a worker? Cluster

I'm new to Tensorflow, I want to perform distributed computing/training using different machines.
The tutorial in this link mentions:
In practice, users would create multiple workers on external IP addresses/ports, and set TF_CONFIG on each worker appropriately.
I didn't find anything that tells how to do that.
I did find tutorials that used an old version of TensorFlow, but there was no TF_CONFIG there and I don't see any ClusterSpec used in the example, so I'm very confused.
Turns out the answer was simpler than expected.
Set the same TF_CONFIG on all machines and then run the same script on all machines.
The training does not start until all nodes/workers are connected.

Optimizing a neural net for running in an embedded system

I am running some code on an embedded system with an extremely limited memory, and even more limited processing power.
I am using TensorFlow for this implementation.
I have never had to work in this kind of environment before.
What are some steps I can take to ensure I am being efficient as possible in my implementations/optimization?
Some ideas -
- Pruning code -
https://jacobgil.github.io/deeplearning/pruning-deep-learning
- Ensure loops are as minimal as possible (in the big O sense)
- ...
Thanks a lot.
I suggest using TensorFlow Lite.
It will enable you to compress and quantize your model to make it smaller and faster to run.
It also supports leveraging GPU and/or hardware accelerator if any of this is available to you.
https://www.tensorflow.org/lite
If you are working with TensorFlow 1.13 (the latest stable version before the 2.0 prototype), there is a pruning function from tf.contrib submodule. It contains a sparcity parameter that you can tune to determine the size of the network.
I suggest you to take a look at all the tf.contrib.model_pruning submodule here. It's plenty of functions you might need for your specific task.

Is it possible to use TensorFlow Serving with distributed TensorFlow cluster to improve throughput/latency?

I'm looking into ways to improve latency and/or throughput of a TensorFlow Serving instance. I've seen the "Serving Inception" manual and three GitHub Issues (2, 3, 4), but all of them seem to create a separate instance of TensorFlow Serving per server and then choosing server on client. Issue 4 is actually about adding some load balancer in front of that stuff, which is currently absent in TensorFlow Serving itself.
However, there is also "Distributed TensorFlow" tutorial which shows how to join a set of machines into a fixed cluster and then manually "pin" some computations to some machines, which can improve both latency and throughput if model is "wide" and can be parallelized well. However, I do not see any mentions of combining this with TensorFlow Serving in either documentation.
Question is: is it possible to configure TensorFlow Serving to use distributed TensorFlow cluster?
I was able to make it create and use gRPC sessions (instead of local) with some hacks:
Make tensorflow/core/distributed_runtime/rpc:grpc_session target publicly visible (it's internal to tensorflow package by default) by modifying BUILD file.
Add it as a dependency to the tensorflow_serving/model_servers:tensorflow_model_server target.
Add an extra flag to tensorflow_model_server called --session_target which sets up session_bundle_config.session_target() in main.cc.
Run the binary with --session_target=grpc://localhost:12345, where localhost:12345 is an arbitrary node which will be used to create master sessions.
See my cluster performing some computations on behalf of TensorFlow Serving.
However, this set of hacks does not look enough for "real-world usage" for three reasons:
grpc_session target is probably internal for a reason.
As noticed in my other question, distributed TensorFlow works better when computations are manually "pinned" to specific machines. So, if we use TensorFlow Serving, we need a way to save those "pins" and model's structure becomes tied with cluster's structure. I'm not sure whether this information is exported with Exporter/Saver at all.
tensorflow_model_server creates session once - during bootstrap. If master node of the cluster goes down and then restores, serving server still holds the "old" session and cannot process further requests.
All in all, it looks like this scenario is not officially supported yet, but I'm not sure.
If your model fits into single machine, then it's hard to see how distributing it over many machines will improve throughput. Essentially you are taking computations which can be done independently and adding a dependency. If one of your machines is slow or crashes, instead of making some queries slow, it will make all queries sow.
That said, it's worth benchmarking to see if it helps, in which case it would make sense to ask for this use-case to be officially supported.
Regarding questions:
Worker assignments are done through device field in graph .pbtxt. Some importers/exporters clear those assignments and have clear_devices flag. You could open graph definition (.pbtxt file or equivalently, str(tf.get_default_graph().as_graph_def(), and grep for device strings to check)
If any worker restarts, or there's some temporary network connectivity your sess.run fails with error (Unavailable) and you need to recreate the session. This is handled automatically by MonitoredTrainingSession in tf.train, but you need to handle this yourself with serving.
If your model is not using images, or is not entirely too large, you shouldn't need too much compute for each inference/serve, and I'm saying this using Inception-v# which takes ~1 sec to serve a response to an image on a Google Cloud Platform n1-standard-1 machine.
Now that being said, perhaps its the throughput that you need to scale up and that is a different problem. Your best option for scale at that point would be to use Docker Swarm & Compose, as well as Kubernetes to help scale e up and serve your inference "micro-service". You could use flask to iterate over a sequence of requests also if your use-case warrants it.

Using google compute engine for tensorflow project

Google is offering 300$ for free trail registration for google cloud. I want to use this opportunity to pursue few projects using tensorflow. But unlike AWS, I am not able to find much information on the web regarding how to configure a google compute engine. Can anyone please suggest me or point to resources which will help me?
I already looked into google cloud documentation, while they are clear they really dont give any suggestions as to what kind of CPUs to use or for that matter I cannot see any GPU instances when I tried to create a VM instance. I want to use something on the lines of AWS g2.2xlarge instance.
GPUs on Google Cloud are in alpha:
https://cloud.google.com/gpu/
The timeline given for public availability is 2017:
https://cloudplatform.googleblog.com/2016/11/announcing-GPUs-for-Google-Cloud-Platform.html
I would suggest that you think carefully about whether you want to "scale up" (getting a single very powerful machine to do your training) or "scale out" (distributing your training). In many cases, scaling out works out better and cheaper and Tensorflow/CloudML are set up help you do that.
Here are directions on how to get Tensorflow going in a Jupyter notebook on a Google Compute Engine VM:
https://codelabs.developers.google.com/codelabs/cpb102-cloudml/#0
The first few steps are TensorFlow, the last steps are Cloud ML.