sagemaker-tensorflow-serving-container - is there a way to configure underlying tensorflow_model_server REST API timeout - tensorflow-serving

TensorFlow Model Server has a command-line option '--rest_api_timeout_in_ms' that controls the timeout for its rest api. I believe by default this is 30 seconds. I am serving a (slow) TF model with sagemaker-tensorflow-container (https://github.com/aws/sagemaker-tensorflow-serving-container) and getting timeouts from the underlying tensorflow model server process (which is started by the sagemaker container, see here: https://github.com/aws/sagemaker-tensorflow-serving-container/blob/3952606048615297e5629b2b27dfa6557616b986/docker/build_artifacts/sagemaker/serve.py#L178
Looking at the sagemaker-tensorflow-container source I do not see a way to supply this '--rest_api_timeout_in_ms' option :-(.
If anyone faced this or similar problem, I would really appreciate any hints or possible workarounds. Thanks!

Related

TensorFlow 2.0 - How to create a worker? Cluster

I'm new to Tensorflow, I want to perform distributed computing/training using different machines.
The tutorial in this link mentions:
In practice, users would create multiple workers on external IP addresses/ports, and set TF_CONFIG on each worker appropriately.
I didn't find anything that tells how to do that.
I did find tutorials that used an old version of TensorFlow, but there was no TF_CONFIG there and I don't see any ClusterSpec used in the example, so I'm very confused.
Turns out the answer was simpler than expected.
Set the same TF_CONFIG on all machines and then run the same script on all machines.
The training does not start until all nodes/workers are connected.

Optimizing a neural net for running in an embedded system

I am running some code on an embedded system with an extremely limited memory, and even more limited processing power.
I am using TensorFlow for this implementation.
I have never had to work in this kind of environment before.
What are some steps I can take to ensure I am being efficient as possible in my implementations/optimization?
Some ideas -
- Pruning code -
https://jacobgil.github.io/deeplearning/pruning-deep-learning
- Ensure loops are as minimal as possible (in the big O sense)
- ...
Thanks a lot.
I suggest using TensorFlow Lite.
It will enable you to compress and quantize your model to make it smaller and faster to run.
It also supports leveraging GPU and/or hardware accelerator if any of this is available to you.
https://www.tensorflow.org/lite
If you are working with TensorFlow 1.13 (the latest stable version before the 2.0 prototype), there is a pruning function from tf.contrib submodule. It contains a sparcity parameter that you can tune to determine the size of the network.
I suggest you to take a look at all the tf.contrib.model_pruning submodule here. It's plenty of functions you might need for your specific task.

Is it possible to use TensorFlow Serving with distributed TensorFlow cluster to improve throughput/latency?

I'm looking into ways to improve latency and/or throughput of a TensorFlow Serving instance. I've seen the "Serving Inception" manual and three GitHub Issues (2, 3, 4), but all of them seem to create a separate instance of TensorFlow Serving per server and then choosing server on client. Issue 4 is actually about adding some load balancer in front of that stuff, which is currently absent in TensorFlow Serving itself.
However, there is also "Distributed TensorFlow" tutorial which shows how to join a set of machines into a fixed cluster and then manually "pin" some computations to some machines, which can improve both latency and throughput if model is "wide" and can be parallelized well. However, I do not see any mentions of combining this with TensorFlow Serving in either documentation.
Question is: is it possible to configure TensorFlow Serving to use distributed TensorFlow cluster?
I was able to make it create and use gRPC sessions (instead of local) with some hacks:
Make tensorflow/core/distributed_runtime/rpc:grpc_session target publicly visible (it's internal to tensorflow package by default) by modifying BUILD file.
Add it as a dependency to the tensorflow_serving/model_servers:tensorflow_model_server target.
Add an extra flag to tensorflow_model_server called --session_target which sets up session_bundle_config.session_target() in main.cc.
Run the binary with --session_target=grpc://localhost:12345, where localhost:12345 is an arbitrary node which will be used to create master sessions.
See my cluster performing some computations on behalf of TensorFlow Serving.
However, this set of hacks does not look enough for "real-world usage" for three reasons:
grpc_session target is probably internal for a reason.
As noticed in my other question, distributed TensorFlow works better when computations are manually "pinned" to specific machines. So, if we use TensorFlow Serving, we need a way to save those "pins" and model's structure becomes tied with cluster's structure. I'm not sure whether this information is exported with Exporter/Saver at all.
tensorflow_model_server creates session once - during bootstrap. If master node of the cluster goes down and then restores, serving server still holds the "old" session and cannot process further requests.
All in all, it looks like this scenario is not officially supported yet, but I'm not sure.
If your model fits into single machine, then it's hard to see how distributing it over many machines will improve throughput. Essentially you are taking computations which can be done independently and adding a dependency. If one of your machines is slow or crashes, instead of making some queries slow, it will make all queries sow.
That said, it's worth benchmarking to see if it helps, in which case it would make sense to ask for this use-case to be officially supported.
Regarding questions:
Worker assignments are done through device field in graph .pbtxt. Some importers/exporters clear those assignments and have clear_devices flag. You could open graph definition (.pbtxt file or equivalently, str(tf.get_default_graph().as_graph_def(), and grep for device strings to check)
If any worker restarts, or there's some temporary network connectivity your sess.run fails with error (Unavailable) and you need to recreate the session. This is handled automatically by MonitoredTrainingSession in tf.train, but you need to handle this yourself with serving.
If your model is not using images, or is not entirely too large, you shouldn't need too much compute for each inference/serve, and I'm saying this using Inception-v# which takes ~1 sec to serve a response to an image on a Google Cloud Platform n1-standard-1 machine.
Now that being said, perhaps its the throughput that you need to scale up and that is a different problem. Your best option for scale at that point would be to use Docker Swarm & Compose, as well as Kubernetes to help scale e up and serve your inference "micro-service". You could use flask to iterate over a sequence of requests also if your use-case warrants it.

How to run Tensorflow on SLURM cluster with properly configured parameter server?

I am in the fortunate position of having access to my university's SLURM powered GPU cluster. I have been trying to get Tensorflow to run a in a cluster node, but so far I have failed to find any documentation. (Everyone I have spoken to at the university has run it using CPU nodes before or using a single GPU node.
I found an excellent bit of documentation from this previous question here. Unfortunately, it's rather incomplete. All of the other distributed examples I have found such as such as this one rely on explicitly specifying the parameter server.
When I try to run it using the code from the SO question, I appears to work perfectly until it either fails to connect to a nonexistent parameter server or hangs when server.join is called and no print outs are provided to the sbatch outfile (which I understand should happen).
So in short, my question is how would one go about starting Tensorflow on a SLURM cluster? From the sbatch stage onwards. This is my first time dealing with a distributed computing framework besides SPARK on AWS and I would love to learn more about how to properly configure Tensorflow. How do I specify which one of the items in the tf_hostlist for example server as the parameter server? Alternatively can I use sbatch to send slightly different commands to each worker as I have seen done in other examples?

How can I stream data directly into tensorflow as opposed to reading files on disc?

Every tensorflow tutorial I've been able to find so far works by first loading the training/validation/test images into memory and then processing them. Does anyone have a guide or recommendations for streaming images and labels as input into tensorflow? I have a lot of images stored on a different server and I would like to stream those images into tensorflow as opposed to saving the images directly on my machine.
Thank you!
Tensorflow does have Queues, which support streaming so you don't have to load the full data in memory. But yes, they only support reading from files on the same server by default. The real problem you have is that, you want to load in memory data from some other server. I can think of following ways to do this:
Expose your images using a REST service. Write your own queueing mechanism in python and read this data (using Urllib or something) and feed it to Tensorflow placeholders.
Instead of using python queues (as above) you can use Tensorflow queues as well (See this answer), although it's slighly more complicated. The advantage will be, tensorflow queues can use multiple cores giving you better performance, compared to normal python multi-threaded queues.
Use a network mount to fool your OS into believing the data is on the same machine.
Also, remember when using this sort of distributed setup, you will always incur network overhead (time taken for images to be transferred from Server 1 to 2), which can slow your training by a lot. To counteract this, you'll have to build a multi-threaded queueing mechanism with fetch-execute overlap, which is a lot of effort. An easier option IMO is to just copy the data into your training machine.
You can use the sockets package in Python to transfer a batch of images, and labels from your server to your host. Your graph needs to be defined to take a placeholder as input. The placeholder must be compatible with your batch size.